Playing with Yahoo Pig

During the last weeks I’ve been playing with Hadoop and Pig.

Hadoop is a framework for distributed computing that uses the MapReduce programming model and Pig is a domain specific language built on top of Hadoop that allows to execute ad-hoc queries for the analysis of very large datasets. MapReduce was introduced and implemented by Google and they also have created Sawzall, another language for data analysis across clusters. Hadoop and Pig are free software projects and supported by Yahoo.

Pig is an interesting project but it hasn’t been really active during a long time. Recently it has been proposed for the Apache Incubator, so I expect that it will gain more visibility and improvements from the community. In the meantime, I took a look at the code and made some small changes:

  • The Grub console now has basic completion functionality using jline.
  • It’s compiled for a recent Hadoop release. I tried to use 0.14.2 but I got errors executing the MapReduce jobs, so I had to switch to 0.13.1.
  • I removed the support for executing Hadoop from Pig. I think it’s easier to configure a separate Hadoop instance and keep the configuration files outside the main jar file.

The code is distributed under the BSD license (the same used by Pig) and it’s published the in my personal Mercurial repository.