Musings of a Data Scientist

Monday, February 23, 2015

Faster Pig with Tez (part 2)

After making some changes to VirtualBox, which allow my Sandbox to download files from the Internet, I am now returning to the Hortonworks Faster Pig with Tez tutorial. I have just downloaded the sample lahman591-csv.zip file using

wget http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

It worked! So then I unzipped it by doing

unzip lahman591-csv.zip

and then changed to the lahman591-csv directory and verified that the files were properly downloaded.

The next step is to load Batting.csv into hadoop, by using the command

hadoop fs -put ./Batting.cs /user/guest/

That took several seconds. That's way longer than I thought it should take and at first I thought something was wrong. The file is only 95k lines (using wc Batting.csv). It's probably due to the resources of my 8 GB machine running the 4 GB Sandbox virtual machine. I'll keep that in mind and see if my observations remain consistent with that line of thought.

Just to do a sanity check on how long these commands might take, I did a quick

hadoop fs -ls /user/guest/

and it also took several seconds. So then I did

time hadoop fs -ls /user/guest

to track the time exactly and it took 5.24 sec! Ok, so that's sort of a benchmark for me to work from. Let's move on.

I'm about to run my first Pig script ever. The tutorial tells me to open a file called 1.pig and paste in

batting = LOAD '/user/guest/Batting.csv' USING PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
DUMP join_data;

Already, I like the look of Pig. I'm quite used to SQL (it's approximately my 10th language) -- both its awesome stuff and its quirks. I really like the flow of this Pig script. I heard it was very linear and that's a nice improvement over how obscure SQL can be in some aspects.

Then it says to run the Pig script using the MapReduce engine by typing

pig 1.pig

I ran it and it looked like it was going to take a while so I killed it and ran

time pig 1.pig

It happily (if not slowly) obeyed. The time to run was 1m 34.9s (real) and 0m 19.0s (user).

Ah, and the tutorial shows that it took their single-node "pseudocluster" 2m 11s to run. Now I don't feel so bad. My experience in a Hadoop/Hive environment is that it shouldn't take quite this long, but I may have been spoiled on super sweet high-end hardware.

And now we move to the last portion of the tutorial: Running Pig on Tez. It took less than half the time on their system and it took about half the time on my system as well (0m 45 s).

This is without any optimization. That sounds very cool that there could be some speed increases.

Pig on Tez, over and out.

Friday, February 20, 2015

Faster Pig with Tez

I'm starting this HDS tutorial called Faster Pig with Tez.

I am interested in learning how Pig works and also what Tez can do. It was really nice to see a definition for "tez" (it's Hindi for "speed"). I've often wondered about various tool names, why they were given their names and what they mean. Ok, then. Moving on.

So I started up my Sandbox and connected to it from a terminal via

ssh root@127.0.0.1 -p 2222

The tutorial tells me to start with a baseball data set that can be downloaded using the command

wget http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

However, my Sandbox VM could not resolve the host. That's because I hadn't set up VirtualBox's network adapters properly. I shut down the Sandbox, opened Preferences | Network | Host-only Networks and then clicked on the +Adapter button (right side of menu). That added 'vboxnet0' to the (previously empty) list. I went back to the Sandbox settings and added Adapter 2. I changed the 'Attached to' to Host-only Adapter and selected 'Name' to be 'vboxnet0'.

Then I started up the Sandbox again and tried to run the wget command (above), but it still could not resolve the host. So I checked /etc/resolv.conf in the Sandbox and it showed

nameserver 8.8.8.8

That's the Google DNS, but for some reason it wasn't working for me. I tried to ping 8.8.8.8 from the Sandbox and it worked just fine, but it could not ping www.google.com. What?!

Then I did

nslookup hortonassets.s3.amazonaws.com

on my computer and got an ip address of 54.231.2.49, but the Sandbox could not do

wget 54.231.2.49/pig/lahman591-csv.zip

Weird. So I changed /etc/resolv.conf to look at my local network's DNS (which was what my computer's /etc/resolv.conf was pointing at).

Then I tried to wget again and...

Success!

Good, because that took a while to think up all that stuff (and more). I even tried to download the csv onto my laptop and scp it over to the VM with

scp ~/Downloads/lahman591-csv.zip root@127.0.0.1:~

but I got an error

ssh: connect to host 127.0.0.1 port 22: Connection refused

I didn't fully work out those problems. (Though I did see that the Sandbox's Network settings has a Port Forward button that shows that the ssh Host Port is 2222 -- like how I ssh'd into the Sandbox -- but the Guest Port is 22. There's a resolution in there somewhere. But I digress...)

So now I need to return to the first step in the Faster Pig with Tez tutorial...

Friday, February 13, 2015

Recommendation Engines

I'm doing some light reading about creating/coding a recommendation engine. Here are the papers that I'm starting with:

Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)

Matrix Factorization Techniques for Recommender Systems (2009)

Recommender Systems (2010)

Hortonworks 2.2 (Sandbox)

I'm going to document some of my journey into the world of Hortonworks. I have worked in a Hadoop/Hive environment before and I am really impressed with new tools that have been added to that Apache stack.

I've just decided to download the Hortonworks Data Platform (HDP) version 2.2 (released Dec 2014) and play around with it. I have been planning on blogging about my experiences as a data scientist so I thought this might be a good time to start a blog and keep track of ideas and progress.

I downloaded the HDP 2.2 virtual environment for VirtualBox (VMWare and HyperV are also available) from

http://hortonworks.com/hdp/downloads/

And here's the install guide.

The image is 4.7 GB, about twice the size as the previous HDP 2.1. I'm not sure why it is so much larger, but they have added 4 (not even close to doubling) new tools.

After opening the VM, it shows instructions for two ways to interact with the Sandbox: through the browser

http://127.0.0.1:8888

and ssh

ssh root@127.0.0.1 -p 2222

I opened the browser (http://127.0.0.1:8888) and found that the ssh password is hadoop. I successfully logged in through ssh. I also followed the new link in the "Dive right in" section, which took me to the Sandbox landing page. The login is hue and the password is 1111, but I appear to be logged in upon arrival.

Now I'll start looking through the tutorials on the Hadoop for Data Scientists & Analysts page.