Thursday, August 4, 2011

Large-scale Named Entity Recognition in the Cloud

Below is an amazing factoid shared by Carlos Guestrin during his GraphLab overview:


CoEM is an entity-extraction algorithm introduced by Rosie Jones in 2005. In the slide above, the data set was a graph with 2M vertices and 200M edges. A team at CMU applied CoEM using Hadoop and found it took about 7.5 hours. GraphLab on far fewer cores (from 95 to 16 cores) took 30 minutes: 6X less cores, but still 30X faster. The same problem on 32 EC2 machines (with 256 processors) took 80 seconds, or 0.3% of the time it took to complete the same task in Hadoop!

If you're interested in large-scale machine-learning, do watch Carlos' GraphLab overview. What's intriguing is that the GraphLab team has put together a precompiled GraphLab EC2 AMI, making it much easier to get the GraphLab system up and running on EC2. Carlos describes many more experiments that they conducted on EC2: the total cost of running many large, compute intensive experiments was a mere $4,000. Compare that to the cost of purchasing and maintaining a similar cluster on your own.

Update: See comment below from Aapo Kyrola. It turns out the multicore version of CoEM on GraphLab was even more impressive. (Instead of 30 mins, it actually runs in 2-3 mins on 16 core. In EC2, in maybe 1 mins. For three full iterations.).

Related posts:
  • Text Mining and Twitter III- LDA Code on Hadoop

  • Some thoughts on (natural language) Search

  • Compressed Sensing and Big Data

  • 1 comments:

    Aapo Kyrola said...

    Hi, I actually implemented CoEM for GraphLab. The initial multicore version of CoEM had an unfortunate performance problem (I stupidly recomputed tf-idf factor for each category, although it was same for all). Instead of 30 mins, it actually runs in 2-3 mins on 16 core. In EC2, in maybe 1 mins. For three full iterations.