wget http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip
It worked! So then I unzipped it by doing
unzip lahman591-csv.zip
and then changed to the lahman591-csv directory and verified that the files were properly downloaded.
hadoop fs -put ./Batting.cs /user/guest/
That took several seconds. That's way longer than I thought it should take and at first I thought something was wrong. The file is only 95k lines (using wc Batting.csv). It's probably due to the resources of my 8 GB machine running the 4 GB Sandbox virtual machine. I'll keep that in mind and see if my observations remain consistent with that line of thought.
Just to do a sanity check on how long these commands might take, I did a quick
hadoop fs -ls /user/guest/
and it also took several seconds. So then I did
time hadoop fs -ls /user/guest
to track the time exactly and it took 5.24 sec! Ok, so that's sort of a benchmark for me to work from. Let's move on.
I'm about to run my first Pig script ever. The tutorial tells me to open a file called 1.pig and paste in
batting = LOAD '/user/guest/Batting.csv' USING PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
DUMP join_data;
Already, I like the look of Pig. I'm quite used to SQL (it's approximately my 10th language) -- both its awesome stuff and its quirks. I really like the flow of this Pig script. I heard it was very linear and that's a nice improvement over how obscure SQL can be in some aspects.
Then it says to run the Pig script using the MapReduce engine by typing
pig 1.pig
I ran it and it looked like it was going to take a while so I killed it and ran
time pig 1.pig
It happily (if not slowly) obeyed. The time to run was 1m 34.9s (real) and 0m 19.0s (user).
Ah, and the tutorial shows that it took their single-node "pseudocluster" 2m 11s to run. Now I don't feel so bad. My experience in a Hadoop/Hive environment is that it shouldn't take quite this long, but I may have been spoiled on super sweet high-end hardware.
And now we move to the last portion of the tutorial: Running Pig on Tez. It took less than half the time on their system and it took about half the time on my system as well (0m 45 s).
This is without any optimization. That sounds very cool that there could be some speed increases.
Pig on Tez, over and out.
batting = LOAD '/user/guest/Batting.csv' USING PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
DUMP join_data;
Already, I like the look of Pig. I'm quite used to SQL (it's approximately my 10th language) -- both its awesome stuff and its quirks. I really like the flow of this Pig script. I heard it was very linear and that's a nice improvement over how obscure SQL can be in some aspects.
Then it says to run the Pig script using the MapReduce engine by typing
pig 1.pig
I ran it and it looked like it was going to take a while so I killed it and ran
time pig 1.pig
It happily (if not slowly) obeyed. The time to run was 1m 34.9s (real) and 0m 19.0s (user).
Ah, and the tutorial shows that it took their single-node "pseudocluster" 2m 11s to run. Now I don't feel so bad. My experience in a Hadoop/Hive environment is that it shouldn't take quite this long, but I may have been spoiled on super sweet high-end hardware.
And now we move to the last portion of the tutorial: Running Pig on Tez. It took less than half the time on their system and it took about half the time on my system as well (0m 45 s).
This is without any optimization. That sounds very cool that there could be some speed increases.
Pig on Tez, over and out.