Musings of a Data Scientist: March 2015

Friday, March 6, 2015

ETL: An example with Python/MySQL

As a data scientist, I often have to prepare data sets before I get to start an analysis. The process of getting information into a useful format is called extract, transform, load (ETL). This is not the only that data is handled/processed. Another useful (and similar) process is ELT, where the transformation happens after it is loaded.

Here's a quick example of how I connected to one MySQL database, performed a query, processed the results and then put the new information back into a different MySQL table.

The example is pretty vague because I used this for a work project and I've removed/changed all the secret stuff. It should still be useful as a starting point for your similar project.

import mysql.connector
import mysql.connector.errors

# connect with cursor to get table
cnxInput = mysql.connector.connect(user='root',password='',host='ip.ad.dre.ss1',database='mysql')
cursorInput = cnxInput.cursor()
queryInput = ("select " # put columns here
"from " # tablename
)
cursorInput.execute(queryInput)

# connect with cursor to write output
cnxOutput = mysql.connector.connect(user='root',password='',host='ip.ad.dre.ss2',database='mysql')
cursorOutput = cnxOutput.cursor()

add_rows = (" INSERT INTO " # tablename
# list of column names
"VALUES (%s,%s) ")

interactions = {}

# process rows incursorInput
for some, stuff in cursorInput:
# see if interactions has mail_id key
if (not interactions.has_key(mail_id)):
# initialize this mail_id

# see if this recipient_id has been added before
if (recipient_sender.has_key(recipient_id)):
# see if this sender_id has been added before
if (recipient_sender[recipient_id].has_key(sender_id)):
# count new emails from this sender
if (...):
# stuff
else:
# other stuff
else:
# initialize new recipient/sender pair

# loop over interactions and output to cursorOutput
for mail_id in interactions:
try:
# write to cursorOutput
row_data = (
# organize data to be stored
)
cursorOutput.execute(add_rows, row_data)
except IntegrityError:
print "duplicate key", mail_id

Thursday, March 5, 2015

ipython notebook and nbconvert

I just found myself in a pinch. I needed to get one of my ipython notebooks to someone quickly and I heard that there was a way to convert the .ipynb into something else (like a .py).

So I googled it: convert ipython notebook

What I found was this link for nbconvert. It's so sweet! It converts ipython notebooks into several output formats such as html, latex, slides, markdown, rst and python. It's as simple as

ipython nbconvert notebook.ipynb

which converts notebook.ipynb to notebook.html. Or you can do

ipython nbconvert --to python notebook.ipynb

to convert the file to notebook.py.

Super cool!

Wednesday, March 4, 2015

Matrix Factorization Techniques for Recommender Systems

"As the Netflix Prize competition has demonstrated, matrix factorization models are superior to classic nearest-neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels."

So begins the paper written by Yehuda Koren (Yahoo Research), Robert Bell and Chris Volinsky (AT&T Labs--Research) and published by the IEEE Computer Society.

Companies like Netflix and Amazon use what they know about their subscribers/visitors so they can optimize recommendations and create more value. This paper talks about the recommender systems and approaches. That's where I'll start for this post.

Recommendation systems are based on two main strategies, content filtering and collaborative filtering.

Content Filtering

Content filtering focuses on creating a profile for a user or product to describe his/her/its nature. Movies are described by genre, rating, actors, box office success, etc. People might be characterized by their demographics or answers to survey questions.

One drawback here is that the data can be hard to find or collect. For example, movie-watchers don't watch all movies ever and when they do watch a movie, they may not rate it.

Collaborative Filtering

Collaborative filtering relies on past user behavior, such as transactions and rating products. One positive aspect here is that no profile is created for new users. This approach analyzes the connections between users and the similarities among products in order to make recommendations.

A drawback is that collaborative filtering needs to address what to do with a new product or user. It takes some time to collect data to understand it/him/her better. This is called the "cold start" problem.

There are two main areas of collaborative filtering, neighborhood methods and latent factor methods.

Neighborhood Methods

Neighborhood models looks at the relationships between items (or between users). Items are given a preference by a user and neighboring items are those which are given a similar rating by the same user.

An example in the paper is of the movie Saving Private Ryan, whose neighbors might be war movies, Steven Spielberg movies or Tom Hanks movies, etc. To predict a specific user's preference for the movie, one needs to look at the neighboring movies that this user rated.

Latent Factor Methods

This method tries to characterize the items and the users by learning on a number of "latent" (hidden) factors. The number of latent variables can be small or large. Choosing a number that is too small may hurt the prediction engine because it may not learn enough details about items and users. On the other hand, choosing a number that is too high will not provide additional helpful information. The run-time will be longer though.

The idea of latent variables are interesting to think about and understand. By their very nature (being hidden), we will not know what they are. It's just the way the learning system understands the features it encounters.

This may be a little clearer with some examples. If we could know what the latent variables were, we might see some recognizable ones like genre or movie duration. We might also see some that are less recognizable or measurable such as "depth of character" or "quirkiness". Still, other latent factors might not be recognizable at all. These features are all treated as orthogonal dimensions and are represented numerically.

Very interesting.

Monday, March 2, 2015

Importing table into MySQL onto a VM

Here are a few steps that I've had to do recently to get a .sql table into a new (remote) VM. Since I've had to do it a few times and I didn't remember all the details, I'm going to write it all out. Hopefully this will help solidify these commands in the ol' thinker.

First, copy the file from the current directory to the remote VM using scp. The public/private keys have already been put in place.

scp -i path/privatekey filename user@ip:/directory

The transfer was pretty quick. Now ssh into the VM by doing the command

ssh -i path/privatekey user@ip

Now navigate to the directory of the file just scp'd. The first time through these steps, the database needs to be created. After it is created or when importing other .sql tables, use

mysql create database databasename

And, finally, tell MySQL to import the desired .sql table by doing this command

mysql -u user -p password databasename < tablename.sql

I view the progress of the import by opening Sequel Pro (Mac), connecting to the remote VM and selecting the databasename and tablename. Even while the table is loading, you can view the size of the table and how many rows have been loaded. Just click "Refresh table list" (the circular arrow button in the lower-left corner of Sequel Pro) to check the progress because the stats on-screen will not auto refresh unless you select a different table and then re-select the table of interest.