Friday, March 6, 2015

ETL: An example with Python/MySQL

As a data scientist, I often have to prepare data sets before I get to start an analysis. The process of getting information into a useful format is called extract, transform, load (ETL). This is not the only that data is handled/processed. Another useful (and similar) process is ELT, where the transformation happens after it is loaded.

Here's a quick example of how I connected to one MySQL database, performed a query, processed the results and then put the new information back into a different MySQL table.

The example is pretty vague because I used this for a work project and I've removed/changed all the secret stuff. It should still be useful as a starting point for your similar project.

import mysql.connector
import mysql.connector.errors

# connect with cursor to get table
cnxInput = mysql.connector.connect(user='root',password='',host='ip.ad.dre.ss1',database='mysql')
cursorInput = cnxInput.cursor()
queryInput = ("select " # put columns here
              "from " # tablename
              )
cursorInput.execute(queryInput)

# connect with cursor to write output
cnxOutput = mysql.connector.connect(user='root',password='',host='ip.ad.dre.ss2',database='mysql')
cursorOutput = cnxOutput.cursor()

add_rows = (" INSERT INTO " # tablename
    # list of column names
    "VALUES (%s,%s) ")

interactions = {}

# process rows incursorInput
for some, stuff in cursorInput:
  # see if interactions has mail_id key
  if (not interactions.has_key(mail_id)):
    # initialize this mail_id

  # see if this recipient_id has been added before
  if (recipient_sender.has_key(recipient_id)):
    # see if this sender_id has been added before
    if (recipient_sender[recipient_id].has_key(sender_id)):
      # count new emails from this sender
      if (...):
        # stuff
      else:
        # other stuff
  else:
    # initialize new recipient/sender pair
    
# loop over interactions and output to cursorOutput
for mail_id in interactions:
  try:
    # write to cursorOutput
    row_data = (
      # organize data to be stored
      )
    cursorOutput.execute(add_rows, row_data)
  except IntegrityError:
    print "duplicate key", mail_id

No comments:

Post a Comment