CoEM is an entity-extraction algorithm introduced by Rosie Jones in 2005. In the slide above, the data set was a graph with 2M vertices and 200M edges. A team at CMU applied CoEM using Hadoop and found it took about 7.5 hours. GraphLab on far fewer cores (from 95 to 16 cores) took 30 minutes: 6X less cores, but still 30X faster. The same problem on 32 EC2 machines (with 256 processors) took 80 seconds, or 0.3% of the time it took to complete the same task in Hadoop!
If you're interested in large-scale machine-learning, do watch Carlos' GraphLab overview. What's intriguing is that the GraphLab team has put together a precompiled GraphLab EC2 AMI, making it much easier to get the GraphLab system up and running on EC2. Carlos describes many more experiments that they conducted on EC2: the total cost of running many large, compute intensive experiments was a mere $4,000. Compare that to the cost of purchasing and maintaining a similar cluster on your own.
Update: See comment below from Aapo Kyrola. It turns out the multicore version of CoEM on GraphLab was even more impressive. (Instead of 30 mins, it actually runs in 2-3 mins on 16 core. In EC2, in maybe 1 mins. For three full iterations.).
Related posts:
TweetText Mining and Twitter III- LDA Code on Hadoop
Some thoughts on (natural language) Search
Compressed Sensing and Big Data

1 comments:
Hi, I actually implemented CoEM for GraphLab. The initial multicore version of CoEM had an unfortunate performance problem (I stupidly recomputed tf-idf factor for each category, although it was same for all). Instead of 30 mins, it actually runs in 2-3 mins on 16 core. In EC2, in maybe 1 mins. For three full iterations.
Post a Comment