Data collection and strange values in CSV format

When you start a new innovation project your data is not always in a structured database. Most of the times you need to import a CSV file for a quick analysis. Thereafter you can manipulate your data. Finally you can start data modelling. In one of my last projects I was working on a proof of concept for TV attribution. TV attribution determines the interaction between TV commercials and online visits. Therefore we need TV schedules and online visits on minute level. Our online data is in a database, but TV schedules were not.

Strange values

The first time I was reading the TV schedules, I saw data that I didn’t expected to see. Below you can see a small sample of the output. We see some strange values like ‘–’ instead of a date and times above 24 hours.

######################
  """import libraries"""
  ######################
  import pandas as pd
  import numpy as np
  import datetime
  import calendar
  import os

  ##################################
  """read data for the first time"""
  ##################################
  #define file_path
  path = os.path.dirname(os.path.realpath(__file__))
  file_name = 'tv_schedule.csv'
  filepath_or_buffer =  path + '\\' + file_name
  #define if head is in csv
  header_boolean = True
  #define header input
  if header_boolean == True:
      header_csv = int(0)
  else:
      header_csv = None
  #read data
  data = pd.read_csv(filepath_or_buffer=filepath_or_buffer,
                     header=header_csv)
  #print data
  print 'first outcome: \n', data

first outcome:
            date      time     channel   grp
  0           --        xx        NPO1   2.0
  1           --        xx        NPO1   1.2
  2   30-11-2014  24:23:00        NPO2   1.8
  3   30-12-2014        xx  TV Gelderl   0.1
  4   30-12-2014        xx       TV Nh   0.1
  5   30-12-2014     19:23        RTL4   0.4
  6   30-12-2014  24:46:00     TV West   0.1
  7   30-12-2014        xx   O.Brabant   0.1
  8           --        xx    Rijnmond   0.0
  9   31-12-2014     19:54        SBS6   0.2
  10  31-12-2014     20:21        NPO1  11.1
  11  31-12-2014        xx  TV Gelderl   0.1
  12  31-12-2014        xx     TV Oost   0.2
  13  31-12-2014        xx  TV Zeeland   0.1
  14  31-12-2014  25:24:00        NPO1   0.8

  [15 rows x 4 columns]

To adjust the data we need to add some intelligence. First, we have to know which rows are ‘bad’, so we can skip them. Second, we have to parse the datetime values to a datetime format. So, let’s give it a try!

Search bad rows

The first challenge is to search which rows you don’t want to import from the CSV file. With the help of Pandas.read_csv you can add which rows you want to skip. We will use that later on. One of the issues we can see is ‘–’ in the date column and ‘xx’ in the time column. Therefore we should indicate in which column we have an issue.

###################
  """define issues"""
  ###################
  issues = {
      "issue_0": {
          "column": 0,#integer if you know which column (imagine you have no headers)
          "indicator_of_bad_row": '--'},
      "issue_1": {
          "column": 'time', #string if you know the name of the column
          "indicator_of_bad_row": 'xx'}
  }

Now we know the issues, we have to determine the ‘bad’ rows. Below we define a function that returns a Numpy array including the ‘bad’ rows. For this we need to take three steps: 1) import data; 2) find ‘bad’ rows per issue and; 3) find unique ‘bad’ rows over all issues.

##################
  """get bad rows"""
  ##################
  def get_bad_rows(filepath_or_buffer, issues, header_csv):
      """return bad rows array that include incorrect rows"""
      ###Import data
      #read data
      data = pd.read_csv(filepath_or_buffer=filepath_or_buffer,
                         header=header_csv)
      ###Find bad rows per issue
      #define column names
      column_names = data.columns
      #initialize empty bad_rows_array
      bad_rows_total_array = np.array([])
      #find bad rows per issue
      for i in issues:
          issue =  issues[i]
          #determine if column of issue is a string
          string_boolean  = isinstance(issue['column'], str)
          #select data of bad rows if column is a string
          if string_boolean == True:
              column_name = issue['column']
          #select data of bad rows if column is not a string but an integer or float
          else:
              column_name = column_names[issue['column']]
          #select subset of 'bad' rows
          data_bad_rows = data[data[column_name] == issue['indicator_of_bad_row']]
          #get bad rows per issue
          bad_rows_issue_array = data_bad_rows.index.values
          #add bad rows to total array of bad rows
          bad_rows_total_array = np.concatenate((bad_rows_total_array, bad_rows_issue_array)).astype(int)
      ###Find unique bad rows for all issues
      #get unique bad_rows
      bad_rows_total_array = np.unique(bad_rows_total_array)
      #if header is true than change value by +1 because of skip function in pd.read_csv
      if header_boolean == True:
          bad_rows_total_array +=1
      return bad_rows_total_array

  #get bad rows
  bad_rows_array = get_bad_rows(filepath_or_buffer=filepath_or_buffer,
                                issues=issues,
                                header_csv=header_csv)

  #print bad rows array
  print 'bad rows: \n', bad_rows_array

bad rows:
  [ 1  2  4  5  8  9 12 13 14]

The outcome looks incorrect at first sight, but it is correct. Pandas takes the header as first row(0) in the header, therefore we have to shift values by one if the CSV file has headers.

Parse dates

Ok, so we know the ‘bad’ rows in the CSV file. But what about the strange datetime values? For example, we see hours that are higher than 24. From experience I know that 31-12-2014 at 25:24:00 means 1-1-2015 at 01:15:00. In this case we need to shift year, month, day and hour to return the correct datetime format. If hour is 24 or higher than we need to shift values. Thereafter we need to check if it is the last day of the month. If yes, we need to change the day value to the first day of the month. The final changes are the month and year values. This depends on the month value. In case it is the last month of the year we need to shift the year value by 1 and the month value by 1. In other cases we only need to shift the month value by 1. Datetime.timedelta can help us out in combination with one if statement (if the hour value is not right) instead of multiple if statements. Below you can see the date parser function.

#################
  """date parser"""
  #################
  def date_parser(date, time):
      """define date parsers with seperate date and time columns"""
      ###get day, month, year as integer
      date_str = str(date)
      day, month, year = map(int, date_str.split("-"))
      ###get hour, minute and second as integer
      time_str = str(time)
      count_time_str = time_str.count(':')
      if count_time_str == 1:
          hour, minute = map(int, time_str.split(":"))
          second = int(0)
      if count_time_str == 2:
          hour, minute, second = map(int, time_str.split(":"))
      ###correct hour, day, month and year values in case of strange hour value
      if hour >= 23:
          #get datetime format
          datetime_format = datetime.datetime(year, month, day, int(23), minute, second)
          #correct datetime because of hour values
          datetime_format = datetime_format + datetime.timedelta(hours=(hour-23))
      else:
          #get datetime format
          datetime_format = datetime.datetime(year, month, day, hour, minute, second)
      # return
      return datetime_format

Import final data

If we skip the ‘bad’ rows and use the date parser when we import the data than our data is ready for data manipulation. Below we can see the results and it looks perfect!

###################################
  """skip bad rows and parse dates"""
  ###################################
  #define positions of datetime columns
  date_position = 0
  time_position = 1
  #import data for data manipulation
  data = pd.read_csv(
              filepath_or_buffer=filepath_or_buffer,
              header=header_csv,
              encoding = 'iso-8859-1',
              parse_dates = {'datetime': [date_position, time_position]},
              date_parser = date_parser,
              skiprows=bad_rows_array
          )

  print 'final outcome: \n', data

data outcome:
               datetime  channel   grp
  0 2014-12-01 00:23:00     NPO2   1.8
  1 2014-12-30 19:23:00     RTL4   0.4
  2 2014-12-31 00:46:00  TV West   0.1
  3 2014-12-31 19:54:00     SBS6   0.2
  4 2014-12-31 20:21:00     NPO1  11.1
  5 2015-01-01 01:24:00     NPO1   0.8

  [6 rows x 3 columns]

As we can seen above, only the ‘good’ rows are imported and for example 31-12-2014 at 25:24:00 is adjusted to 1-1-2015 at 01:24:00.

Conclusion

From now on it is easy to skip ‘bad’ rows if there any other issues by adding them to the other issues. The date parser works perfect in this case. However I think it would be a great contribution to Pandas if we will write the ultimate date parser library. There are so many formats and exceptions with datetime values that we wrote dozens of different date parsers at Blue Mango.