I ran into people from several large technology3 companies who are doing a variety of interesting things on HBase. Overall a great event. Kudos to Michael Stack and the rest of the program committee. Here are a few observations and highlights from the conference:
HBase and HDFS: Past, Present, and Future: In a conference centered around a piece of technology, you need an overview of new and upcoming features. HBase and HDFS committer Todd Lipcon gave a good survey centered around reliability/availability, performance, and features. I particularly appreciated the summary table at the end of his talk: ![]()
Storing and Manipulating Graphs in HBase: I found the company more intriguing than the talk itself -- but I should point out that Dan Lynn gave a good talk. Fullcontact has useful data, a simple API, and reasonable pricing. No wonder many companies are leveraging their data to build apps.
Mignify: Started by a group of academics, Mignify is a platform for crawling and storing large amounts of web/unstructured data in HBase. I'm definitely going to try it when they launch.
Two talks on Schema Design: For all it's wonderful features, unfortunately HBase is schema-free. However engineers have found ways to introduce schemas into applications that rely on HBase. In a highly-entertaining talk, Ian Varley4 of Salesforce went over denormalization using nested entities. Ian also mentioned a couple of tools for loading schemas into HBase: Lars George's hbase-schema-manager, and a soon to be open-sourced tool from Salesforce ("scoot"). Aaron Kimball of Wibidata gave a lightning talk on their experience using Avro to create flexible schemas.
Lessons learned from OpenTSDB: Originally built for IT monitoring systems, OpenTSDB (time-series database) is built on top of HBase. Prior to this talk, I knew very little about OpenTSDB other than the fact that StumbleUpon uses it heavily ("hundred of thousands of time series and collects over 1 billion data points per day"). As OpenTSDB matures I can see it being used in many other domains.
HBase Filtering: While very specific to HBase, this talk by O'Reilly author Lars George gave me a feel for working with data stored in tables. As Lars' closing summary table shows, choosing the right HBase filter can get complicated!
HBase Coprocessors: There was no specific talk on Coprocessors, but it was mentioned in many of the sessions I attended. (Update: In the comments below, Lars George points out he actually gave a talk on Coprocessors.) First available in version 0.91.0, Coprocessors improve HBase’s already good scanning performance, by pushing computations up to the server thereby reducing network bottlenecks:
.. you can use filters to reduce the amount of data being sent over the network from the servers to the client. With the coprocessor feature in HBase, you can even move part of the computation to where the data lives.... A coprocessor enables you to run arbitrary code directly on each region server. More precisely, it executes the code on a per-region basis, giving you trigger-like functionality—similar to stored procedures in the RDBMS world. From the client side, you do not have to take specific actions, as the framework handles the distributed nature transparently.
... Use cases for coprocessors are, for instance, using hooks into row mutation operations to maintain secondary indexes, or implementing some kind of referential integrity. Filters could be enhanced to become stateful, and therefore make decisions across row boundaries. Aggregate functions, such as sum(), or avg(), known from RDBMSes and SQL, could be moved to the servers to scan the data locally and only returning the single number result across the network.
(1) They sold 600 seats, with many people left on a long waiting list. Space constraints force them to turn people away.
(2) In fact I met several attendees who met this profile: they worked for companies who were already using Hadoop and considering HBase for some real-time application.
(3) Including Facebook, Yahoo!, Apple, ebay, Salesforce, Adobe, Intuit. Facebook created Cassandra but now appears to be using HBase for many key products (including messages).
(4) Also see Ian's 2009 Masters Thesis, No Relation: The Mixed Blessings of Non-Relational Databases