I decided to attend #hbasecon at the last minute and was pleasantly surprised by the size1 of the conference and the enthusiasm around the topic. The fact that it was the first formal conference could explain the demand, but judging from the sessions I attended and conversations I had, I do think that a sizable portion of the Hadoop community is coalescing around HBase. There are other distributed technologies (e.g., Cassandra) that fill the need for real-time, atomic, high-performance access, but being built on top of HDFS makes HBase a natural option for organizations already committed to Hadoop2.

I ran into people from several large technology3 companies who are doing a variety of interesting things on HBase. Overall a great event. Kudos to Michael Stack and the rest of the program committee. Here are a few observations and highlights from the conference:

  • HBase and HDFS: Past, Present, and Future: In a conference centered around a piece of technology, you need an overview of new and upcoming features. HBase and HDFS committer Todd Lipcon gave a good survey centered around reliability/availability, performance, and features. I particularly appreciated the summary table at the end of his talk:

  • Storing and Manipulating Graphs in HBase: I found the company more intriguing than the talk itself -- but I should point out that Dan Lynn gave a good talk. Fullcontact has useful data, a simple API, and reasonable pricing. No wonder many companies are leveraging their data to build apps.
  • Mignify: Started by a group of academics, Mignify is a platform for crawling and storing large amounts of web/unstructured data in HBase. I'm definitely going to try it when they launch.
  • Two talks on Schema Design: For all it's wonderful features, unfortunately HBase is schema-free. However engineers have found ways to introduce schemas into applications that rely on HBase. In a highly-entertaining talk, Ian Varley4 of Salesforce went over denormalization using nested entities. Ian also mentioned a couple of tools for loading schemas into HBase: Lars George's hbase-schema-manager, and a soon to be open-sourced tool from Salesforce ("scoot").

    Aaron Kimball of Wibidata gave a lightning talk on their experience using Avro to create flexible schemas.

  • Lessons learned from OpenTSDB: Originally built for IT monitoring systems, OpenTSDB (time-series database) is built on top of HBase. Prior to this talk, I knew very little about OpenTSDB other than the fact that StumbleUpon uses it heavily ("hundred of thousands of time series and collects over 1 billion data points per day"). As OpenTSDB matures I can see it being used in many other domains.
  • HBase Filtering: While very specific to HBase, this talk by O'Reilly author Lars George gave me a feel for working with data stored in tables. As Lars' closing summary table shows, choosing the right HBase filter can get complicated!

  • HBase Coprocessors: There was no specific talk on Coprocessors, but it was mentioned in many of the sessions I attended. (Update: In the comments below, Lars George points out he actually gave a talk on Coprocessors.)

    First available in version 0.91.0, Coprocessors improve HBase’s already good scanning performance, by pushing computations up to the server thereby reducing network bottlenecks:

    .. you can use filters to reduce the amount of data being sent over the network from the servers to the client. With the coprocessor feature in HBase, you can even move part of the computation to where the data lives.

    ... A coprocessor enables you to run arbitrary code directly on each region server. More precisely, it executes the code on a per-region basis, giving you trigger-like functionality—similar to stored procedures in the RDBMS world. From the client side, you do not have to take specific actions, as the framework handles the distributed nature transparently.

    ... Use cases for coprocessors are, for instance, using hooks into row mutation operations to maintain secondary indexes, or implementing some kind of referential integrity. Filters could be enhanced to become stateful, and therefore make decisions across row boundaries. Aggregate functions, such as sum(), or avg(), known from RDBMSes and SQL, could be moved to the servers to scan the data locally and only returning the single number result across the network.



  • (1) They sold 600 seats, with many people left on a long waiting list. Space constraints force them to turn people away.

    (2) In fact I met several attendees who met this profile: they worked for companies who were already using Hadoop and considering HBase for some real-time application.

    (3) Including Facebook, Yahoo!, Apple, ebay, Salesforce, Adobe, Intuit. Facebook created Cassandra but now appears to be using HBase for many key products (including messages).

    (4) Also see Ian's 2009 Masters Thesis, No Relation: The Mixed Blessings of Non-Relational Databases

    1. Ilan Kadar on Synthetic Data, Knowledge Graphs, Reinforcement Learning, and Workflow Integration.


      Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

      Subscribe to the Gradient Flow Newsletter


      Full show notes can be found on the Data Exchange web site.

      A video version of this conversation is available on YouTube.   


      Support our work by leaving a small tip💰 here and inviting your friends and colleagues to subscribe to our newsletter📩

    2. Travis Addair on Model Adaptation, SFT vs. RFT, Reward Functions, Data, & Future Trends.


      Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

      Subscribe to the Gradient Flow Newsletter


      Full show notes can be found on the Data Exchange web site.

      A video version of this conversation is available on YouTube.    



      If you enjoyed this post, consider supporting our work by leaving a small tip💰here and inviting your friends and colleagues to subscribe to our newsletter📩

    3. Hagay Lupesko on Wafer-Scale Architecture, High-Speed Inference, Enterprise AI, and Advanced Reasoning.


      Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

      Subscribe to the Gradient Flow Newsletter


      Full show notes can be found on the Data Exchange web site.

      A video version of this conversation is available on YouTube.    


      If you enjoyed this post, consider supporting our work by leaving a small tip💰 here and inviting your friends and colleagues to subscribe to our newsletter📩

    4. Ben Lorica and Paco Nathan on Deregulation vs. Regulation & the Future of Foundation Models.


      Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

      Subscribe to the Gradient Flow Newsletter


      Full show notes can be found on the Data Exchange web site.

      A video version of this conversation is available on YouTube. 


      If you enjoyed this episode, consider supporting our work by leaving a small tip💰 here and inviting your friends and colleagues to subscribe to our newsletter📩

    Loading