Musings of a Data Scientist: ETL: An example with Python/MySQL

As a data scientist, I often have to prepare data sets before I get to start an analysis. The process of getting information into a useful format is called extract, transform, load (ETL). This is not the only that data is handled/processed. Another useful (and similar) process is ELT, where the transformation happens after it is loaded.

Here's a quick example of how I connected to one MySQL database, performed a query, processed the results and then put the new information back into a different MySQL table.

The example is pretty vague because I used this for a work project and I've removed/changed all the secret stuff. It should still be useful as a starting point for your similar project.

import mysql.connector
import mysql.connector.errors

# connect with cursor to get table
cnxInput = mysql.connector.connect(user='root',password='',host='ip.ad.dre.ss1',database='mysql')
cursorInput = cnxInput.cursor()
queryInput = ("select " # put columns here
"from " # tablename
)
cursorInput.execute(queryInput)

# connect with cursor to write output
cnxOutput = mysql.connector.connect(user='root',password='',host='ip.ad.dre.ss2',database='mysql')
cursorOutput = cnxOutput.cursor()

add_rows = (" INSERT INTO " # tablename
# list of column names
"VALUES (%s,%s) ")

interactions = {}

# process rows incursorInput
for some, stuff in cursorInput:
# see if interactions has mail_id key
if (not interactions.has_key(mail_id)):
# initialize this mail_id

# see if this recipient_id has been added before
if (recipient_sender.has_key(recipient_id)):
# see if this sender_id has been added before
if (recipient_sender[recipient_id].has_key(sender_id)):
# count new emails from this sender
if (...):
# stuff
else:
# other stuff
else:
# initialize new recipient/sender pair

# loop over interactions and output to cursorOutput
for mail_id in interactions:
try:
# write to cursorOutput
row_data = (
# organize data to be stored
)
cursorOutput.execute(add_rows, row_data)
except IntegrityError:
print "duplicate key", mail_id

Musings of a Data Scientist

Friday, March 6, 2015

ETL: An example with Python/MySQL

No comments:

Post a Comment