Friday, February 20, 2015

Faster Pig with Tez

I'm starting this HDS tutorial called Faster Pig with Tez.

I am interested in learning how Pig works and also what Tez can do. It was really nice to see a definition for "tez" (it's Hindi for "speed"). I've often wondered about various tool names, why they were given their names and what they mean. Ok, then. Moving on.

So I started up my Sandbox and connected to it from a terminal via

ssh root@127.0.0.1 -p 2222 

The tutorial tells me to start with a baseball data set that can be downloaded using the command

wget http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

However, my Sandbox VM could not resolve the host. That's because I hadn't set up VirtualBox's network adapters properly. I shut down the Sandbox, opened Preferences | Network | Host-only Networks and then clicked on the +Adapter button (right side of menu). That added 'vboxnet0' to the (previously empty) list. I went back to the Sandbox settings and added Adapter 2. I changed the 'Attached to' to Host-only Adapter and selected 'Name' to be 'vboxnet0'.

Then I started up the Sandbox again and tried to run the wget command (above), but it still could not resolve the host. So I checked /etc/resolv.conf in the Sandbox and it showed

nameserver 8.8.8.8

That's the Google DNS, but for some reason it wasn't working for me. I tried to ping 8.8.8.8 from the Sandbox and it worked just fine, but it could not ping www.google.com. What?!

Then I did

nslookup hortonassets.s3.amazonaws.com

on my computer and got an ip address of 54.231.2.49, but the Sandbox could not do

wget 54.231.2.49/pig/lahman591-csv.zip

Weird. So I changed /etc/resolv.conf to look at my local network's DNS (which was what my computer's /etc/resolv.conf was pointing at).

Then I tried to wget again and...

Success!

Good, because that took a while to think up all that stuff (and more). I even tried to download the csv onto my laptop and scp it over to the VM with

scp ~/Downloads/lahman591-csv.zip root@127.0.0.1:~

but I got an error

ssh: connect to host 127.0.0.1 port 22: Connection refused

I didn't fully work out those problems. (Though I did see that the Sandbox's Network settings has a Port Forward button that shows that the ssh Host Port is 2222 -- like how I ssh'd into the Sandbox -- but the Guest Port is 22. There's a resolution in there somewhere. But I digress...)

So now I need to return to the first step in the Faster Pig with Tez tutorial...

No comments:

Post a Comment