A couple of weeks ago, my business partner pointed me to an interesting article on O'Reilly Radar by Edd Dumbill entitled "The SMAQ stack for big data". If you are not interested in the technical details, fast forward to the very last sentence, as it bears repeating:
"The emergence of Linux gave power to the innovative developer with merely a small Linux server at their desk: SMAQ has the same potential to streamline data centers, foster innovation at the edges of an organization, and enable new startups to cheaply create data-driven businesses."
So, what's the SMAQ stack and why is it revolutionary? As the author points out, it's a stack for big data systems, made up of layers of Storage, MapReduce, and Query (SMAQ). Most SMAQ systems seem to be open source, distributed, and running on commodity hardware.
While I did not realize it when we started, here at Dataclip, we are using the SMAQ stack to launch our data-driven business. It is enabling us to cost-efficiently tackle a big data project, with just a few employees. Up until recently, without the SMAQ stack, I don't think we could have quickly scaled and progressed to the point where we are now.
Here's a quick look at how Dataclip is using the SMAQ stack:
S: We are using Hadoop on top of Amazon's EC2. Our storage mechanism is the Hadoop Distributed File System, or HDFS.
M: We are using MapReduce as our framework to process the large amount of data that we are crawling and indexing from tens of millions of websites.
A: The "A" in SMAQ does not currently stand for anything, according to the author's coining of the acronym. But you can't have SMAQ without an "A"!
Q: In our case, for Query, we are using Pig as our scripting language to interface with MapReduce.
Since I am not overly technical, I will stop here for I fear that I will be accused of talking SMACK about the SMAQ stack! In summary, we do agree with Mr Dumbill in that we think that these big data processing tools will lead to a revolution in innovative data-focused businesses being created. We look forward to sharing our experiences as we become one such business.
Thursday, September 30th, 2010 at 4:00pm