Musings of a Data Scientist: 2015

Friday, March 6, 2015

ETL: An example with Python/MySQL

As a data scientist, I often have to prepare data sets before I get to start an analysis. The process of getting information into a useful format is called extract, transform, load (ETL). This is not the only that data is handled/processed. Another useful (and similar) process is ELT, where the transformation happens after it is loaded.

Here's a quick example of how I connected to one MySQL database, performed a query, processed the results and then put the new information back into a different MySQL table.

The example is pretty vague because I used this for a work project and I've removed/changed all the secret stuff. It should still be useful as a starting point for your similar project.

import mysql.connector
import mysql.connector.errors

# connect with cursor to get table
cnxInput = mysql.connector.connect(user='root',password='',host='ip.ad.dre.ss1',database='mysql')
cursorInput = cnxInput.cursor()
queryInput = ("select " # put columns here
"from " # tablename
)
cursorInput.execute(queryInput)

# connect with cursor to write output
cnxOutput = mysql.connector.connect(user='root',password='',host='ip.ad.dre.ss2',database='mysql')
cursorOutput = cnxOutput.cursor()

add_rows = (" INSERT INTO " # tablename
# list of column names
"VALUES (%s,%s) ")

interactions = {}

# process rows incursorInput
for some, stuff in cursorInput:
# see if interactions has mail_id key
if (not interactions.has_key(mail_id)):
# initialize this mail_id

# see if this recipient_id has been added before
if (recipient_sender.has_key(recipient_id)):
# see if this sender_id has been added before
if (recipient_sender[recipient_id].has_key(sender_id)):
# count new emails from this sender
if (...):
# stuff
else:
# other stuff
else:
# initialize new recipient/sender pair

# loop over interactions and output to cursorOutput
for mail_id in interactions:
try:
# write to cursorOutput
row_data = (
# organize data to be stored
)
cursorOutput.execute(add_rows, row_data)
except IntegrityError:
print "duplicate key", mail_id

Thursday, March 5, 2015

ipython notebook and nbconvert

I just found myself in a pinch. I needed to get one of my ipython notebooks to someone quickly and I heard that there was a way to convert the .ipynb into something else (like a .py).

So I googled it: convert ipython notebook

What I found was this link for nbconvert. It's so sweet! It converts ipython notebooks into several output formats such as html, latex, slides, markdown, rst and python. It's as simple as

ipython nbconvert notebook.ipynb

which converts notebook.ipynb to notebook.html. Or you can do

ipython nbconvert --to python notebook.ipynb

to convert the file to notebook.py.

Super cool!

Wednesday, March 4, 2015

Matrix Factorization Techniques for Recommender Systems

"As the Netflix Prize competition has demonstrated, matrix factorization models are superior to classic nearest-neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels."

So begins the paper written by Yehuda Koren (Yahoo Research), Robert Bell and Chris Volinsky (AT&T Labs--Research) and published by the IEEE Computer Society.

Companies like Netflix and Amazon use what they know about their subscribers/visitors so they can optimize recommendations and create more value. This paper talks about the recommender systems and approaches. That's where I'll start for this post.

Recommendation systems are based on two main strategies, content filtering and collaborative filtering.

Content Filtering

Content filtering focuses on creating a profile for a user or product to describe his/her/its nature. Movies are described by genre, rating, actors, box office success, etc. People might be characterized by their demographics or answers to survey questions.

One drawback here is that the data can be hard to find or collect. For example, movie-watchers don't watch all movies ever and when they do watch a movie, they may not rate it.

Collaborative Filtering

Collaborative filtering relies on past user behavior, such as transactions and rating products. One positive aspect here is that no profile is created for new users. This approach analyzes the connections between users and the similarities among products in order to make recommendations.

A drawback is that collaborative filtering needs to address what to do with a new product or user. It takes some time to collect data to understand it/him/her better. This is called the "cold start" problem.

There are two main areas of collaborative filtering, neighborhood methods and latent factor methods.

Neighborhood Methods

Neighborhood models looks at the relationships between items (or between users). Items are given a preference by a user and neighboring items are those which are given a similar rating by the same user.

An example in the paper is of the movie Saving Private Ryan, whose neighbors might be war movies, Steven Spielberg movies or Tom Hanks movies, etc. To predict a specific user's preference for the movie, one needs to look at the neighboring movies that this user rated.

Latent Factor Methods

This method tries to characterize the items and the users by learning on a number of "latent" (hidden) factors. The number of latent variables can be small or large. Choosing a number that is too small may hurt the prediction engine because it may not learn enough details about items and users. On the other hand, choosing a number that is too high will not provide additional helpful information. The run-time will be longer though.

The idea of latent variables are interesting to think about and understand. By their very nature (being hidden), we will not know what they are. It's just the way the learning system understands the features it encounters.

This may be a little clearer with some examples. If we could know what the latent variables were, we might see some recognizable ones like genre or movie duration. We might also see some that are less recognizable or measurable such as "depth of character" or "quirkiness". Still, other latent factors might not be recognizable at all. These features are all treated as orthogonal dimensions and are represented numerically.

Very interesting.

Monday, March 2, 2015

Importing table into MySQL onto a VM

Here are a few steps that I've had to do recently to get a .sql table into a new (remote) VM. Since I've had to do it a few times and I didn't remember all the details, I'm going to write it all out. Hopefully this will help solidify these commands in the ol' thinker.

First, copy the file from the current directory to the remote VM using scp. The public/private keys have already been put in place.

scp -i path/privatekey filename user@ip:/directory

The transfer was pretty quick. Now ssh into the VM by doing the command

ssh -i path/privatekey user@ip

Now navigate to the directory of the file just scp'd. The first time through these steps, the database needs to be created. After it is created or when importing other .sql tables, use

mysql create database databasename

And, finally, tell MySQL to import the desired .sql table by doing this command

mysql -u user -p password databasename < tablename.sql

I view the progress of the import by opening Sequel Pro (Mac), connecting to the remote VM and selecting the databasename and tablename. Even while the table is loading, you can view the size of the table and how many rows have been loaded. Just click "Refresh table list" (the circular arrow button in the lower-left corner of Sequel Pro) to check the progress because the stats on-screen will not auto refresh unless you select a different table and then re-select the table of interest.

Monday, February 23, 2015

Faster Pig with Tez (part 2)

After making some changes to VirtualBox, which allow my Sandbox to download files from the Internet, I am now returning to the Hortonworks Faster Pig with Tez tutorial. I have just downloaded the sample lahman591-csv.zip file using

wget http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

It worked! So then I unzipped it by doing

unzip lahman591-csv.zip

and then changed to the lahman591-csv directory and verified that the files were properly downloaded.

The next step is to load Batting.csv into hadoop, by using the command

hadoop fs -put ./Batting.cs /user/guest/

That took several seconds. That's way longer than I thought it should take and at first I thought something was wrong. The file is only 95k lines (using wc Batting.csv). It's probably due to the resources of my 8 GB machine running the 4 GB Sandbox virtual machine. I'll keep that in mind and see if my observations remain consistent with that line of thought.

Just to do a sanity check on how long these commands might take, I did a quick

hadoop fs -ls /user/guest/

and it also took several seconds. So then I did

time hadoop fs -ls /user/guest

to track the time exactly and it took 5.24 sec! Ok, so that's sort of a benchmark for me to work from. Let's move on.

I'm about to run my first Pig script ever. The tutorial tells me to open a file called 1.pig and paste in

batting = LOAD '/user/guest/Batting.csv' USING PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
DUMP join_data;

Already, I like the look of Pig. I'm quite used to SQL (it's approximately my 10th language) -- both its awesome stuff and its quirks. I really like the flow of this Pig script. I heard it was very linear and that's a nice improvement over how obscure SQL can be in some aspects.

Then it says to run the Pig script using the MapReduce engine by typing

pig 1.pig

I ran it and it looked like it was going to take a while so I killed it and ran

time pig 1.pig

It happily (if not slowly) obeyed. The time to run was 1m 34.9s (real) and 0m 19.0s (user).

Ah, and the tutorial shows that it took their single-node "pseudocluster" 2m 11s to run. Now I don't feel so bad. My experience in a Hadoop/Hive environment is that it shouldn't take quite this long, but I may have been spoiled on super sweet high-end hardware.

And now we move to the last portion of the tutorial: Running Pig on Tez. It took less than half the time on their system and it took about half the time on my system as well (0m 45 s).

This is without any optimization. That sounds very cool that there could be some speed increases.

Pig on Tez, over and out.

Friday, February 20, 2015

Faster Pig with Tez

I'm starting this HDS tutorial called Faster Pig with Tez.

I am interested in learning how Pig works and also what Tez can do. It was really nice to see a definition for "tez" (it's Hindi for "speed"). I've often wondered about various tool names, why they were given their names and what they mean. Ok, then. Moving on.

So I started up my Sandbox and connected to it from a terminal via

ssh root@127.0.0.1 -p 2222

The tutorial tells me to start with a baseball data set that can be downloaded using the command

wget http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

However, my Sandbox VM could not resolve the host. That's because I hadn't set up VirtualBox's network adapters properly. I shut down the Sandbox, opened Preferences | Network | Host-only Networks and then clicked on the +Adapter button (right side of menu). That added 'vboxnet0' to the (previously empty) list. I went back to the Sandbox settings and added Adapter 2. I changed the 'Attached to' to Host-only Adapter and selected 'Name' to be 'vboxnet0'.

Then I started up the Sandbox again and tried to run the wget command (above), but it still could not resolve the host. So I checked /etc/resolv.conf in the Sandbox and it showed

nameserver 8.8.8.8

That's the Google DNS, but for some reason it wasn't working for me. I tried to ping 8.8.8.8 from the Sandbox and it worked just fine, but it could not ping www.google.com. What?!

Then I did

nslookup hortonassets.s3.amazonaws.com

on my computer and got an ip address of 54.231.2.49, but the Sandbox could not do

wget 54.231.2.49/pig/lahman591-csv.zip

Weird. So I changed /etc/resolv.conf to look at my local network's DNS (which was what my computer's /etc/resolv.conf was pointing at).

Then I tried to wget again and...

Success!

Good, because that took a while to think up all that stuff (and more). I even tried to download the csv onto my laptop and scp it over to the VM with

scp ~/Downloads/lahman591-csv.zip root@127.0.0.1:~

but I got an error

ssh: connect to host 127.0.0.1 port 22: Connection refused

I didn't fully work out those problems. (Though I did see that the Sandbox's Network settings has a Port Forward button that shows that the ssh Host Port is 2222 -- like how I ssh'd into the Sandbox -- but the Guest Port is 22. There's a resolution in there somewhere. But I digress...)

So now I need to return to the first step in the Faster Pig with Tez tutorial...

Friday, February 13, 2015

Recommendation Engines

I'm doing some light reading about creating/coding a recommendation engine. Here are the papers that I'm starting with:

Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)

Matrix Factorization Techniques for Recommender Systems (2009)

Recommender Systems (2010)

Hortonworks 2.2 (Sandbox)

I'm going to document some of my journey into the world of Hortonworks. I have worked in a Hadoop/Hive environment before and I am really impressed with new tools that have been added to that Apache stack.

I've just decided to download the Hortonworks Data Platform (HDP) version 2.2 (released Dec 2014) and play around with it. I have been planning on blogging about my experiences as a data scientist so I thought this might be a good time to start a blog and keep track of ideas and progress.

I downloaded the HDP 2.2 virtual environment for VirtualBox (VMWare and HyperV are also available) from

http://hortonworks.com/hdp/downloads/

And here's the install guide.

The image is 4.7 GB, about twice the size as the previous HDP 2.1. I'm not sure why it is so much larger, but they have added 4 (not even close to doubling) new tools.

After opening the VM, it shows instructions for two ways to interact with the Sandbox: through the browser

http://127.0.0.1:8888

and ssh

ssh root@127.0.0.1 -p 2222

I opened the browser (http://127.0.0.1:8888) and found that the ssh password is hadoop. I successfully logged in through ssh. I also followed the new link in the "Dive right in" section, which took me to the Sandbox landing page. The login is hue and the password is 1111, but I appear to be logged in upon arrival.

Now I'll start looking through the tutorials on the Hadoop for Data Scientists & Analysts page.