Quantcast
Channel: Leadership Experience from Trenches - Big Data
Viewing all articles
Browse latest Browse all 9

Introducing Pig

$
0
0

Last week, I ordered Programming Pig from Amazon and started learning. I figured out that it is much easier to do run command lines thank writing full-fledged MapReduce programs when I am feeling little lazy. After all, it won’t hurt to pick up something in parallel while mastering Hadoop. Why not, it is all about parallel processing, isn’t :-)

 

Pig Ecosystem

1. Pig Latin– You use this to write scripts that would be executed by Pig Engine.

2. Grunt– A console UI to run pig commands or scripts written in Pig Latin. The commands/scripts are ultimately run by Pig engine.

3. Pig Engine– Core of Pig. Take Pig scripts or commands as input, converts them to MapReduce programs so that you, a researcher or a data scientist sitting in a lab and experimenting with data before writing a full-blown MapReduce program or actual run on terabytes of data can experiment with sample data.

image

Pig is an engine (+ few more accessories such as Grunt) for executing data flows in parallel on Hadoop. Pig Latin is used to express these data flows. Behind the scene, Pig engine uses HDFS and MapReduce.

In coming days, I would take each of these use cases below and write script to play around:

1. Traditional ETL data pipeline

2. Research on raw data

3. Iterative processing

Like MapReduce, Pig is also suitable for batch processing on Giga or terabytes of data.

Pig Philosophy

1. Pig eat anything – whether metadata or not, Pig operates on data.

2. Pig live anywhere – Hadoop or any other parallel data processing framework

3. Pigs are domestic animals – supports user defined functions, ability to turn optimizer on/off

4. Pigs Fly – Quick data processing (performance)

In next few blogs, I will explore Pig’s data model, Pig Latin and Grunt. Stay turned.

Happy Learning!!!


Viewing all articles
Browse latest Browse all 9

Trending Articles