Musings of a Data Scientist: Faster Pig with Tez (part 2)

After making some changes to VirtualBox, which allow my Sandbox to download files from the Internet, I am now returning to the Hortonworks Faster Pig with Tez tutorial. I have just downloaded the sample lahman591-csv.zip file using

wget http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

It worked! So then I unzipped it by doing

unzip lahman591-csv.zip

and then changed to the lahman591-csv directory and verified that the files were properly downloaded.

The next step is to load Batting.csv into hadoop, by using the command

hadoop fs -put ./Batting.cs /user/guest/

That took several seconds. That's way longer than I thought it should take and at first I thought something was wrong. The file is only 95k lines (using wc Batting.csv). It's probably due to the resources of my 8 GB machine running the 4 GB Sandbox virtual machine. I'll keep that in mind and see if my observations remain consistent with that line of thought.

Just to do a sanity check on how long these commands might take, I did a quick

hadoop fs -ls /user/guest/

and it also took several seconds. So then I did

time hadoop fs -ls /user/guest

to track the time exactly and it took 5.24 sec! Ok, so that's sort of a benchmark for me to work from. Let's move on.

I'm about to run my first Pig script ever. The tutorial tells me to open a file called 1.pig and paste in

batting = LOAD '/user/guest/Batting.csv' USING PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
DUMP join_data;

Already, I like the look of Pig. I'm quite used to SQL (it's approximately my 10th language) -- both its awesome stuff and its quirks. I really like the flow of this Pig script. I heard it was very linear and that's a nice improvement over how obscure SQL can be in some aspects.

Then it says to run the Pig script using the MapReduce engine by typing

pig 1.pig

I ran it and it looked like it was going to take a while so I killed it and ran

time pig 1.pig

It happily (if not slowly) obeyed. The time to run was 1m 34.9s (real) and 0m 19.0s (user).

Ah, and the tutorial shows that it took their single-node "pseudocluster" 2m 11s to run. Now I don't feel so bad. My experience in a Hadoop/Hive environment is that it shouldn't take quite this long, but I may have been spoiled on super sweet high-end hardware.

And now we move to the last portion of the tutorial: Running Pig on Tez. It took less than half the time on their system and it took about half the time on my system as well (0m 45 s).

This is without any optimization. That sounds very cool that there could be some speed increases.

Pig on Tez, over and out.

Musings of a Data Scientist

Monday, February 23, 2015

Faster Pig with Tez (part 2)

No comments:

Post a Comment