When you start a new innovation project your data is not always in a structured database. Most of the times you need to import a CSV file for a quick analysis. Thereafter you can manipulate your data. Finally you can start data modelling. In one of my last projects I was working on a proof of concept for TV attribution. TV attribution determines the interaction between TV commercials and online visits. Therefore we need TV schedules and online visits on minute level. Our online data is in a database, but TV schedules were not.
Strange values
The first time I was reading the TV schedules, I saw data that I didn’t expected to see. Below you can see a small sample of the output. We see some strange values like ‘–’ instead of a date and times above 24 hours.
######################
"""import libraries"""
######################
import pandas as pd
import numpy as np
import datetime
import calendar
import os
##################################
"""read data for the first time"""
##################################
#define file_path
path = os.path.dirname(os.path.realpath(__file__))
file_name = 'tv_schedule.csv'
filepath_or_buffer = path + '\\' + file_name
#define if head is in csv
header_boolean = True
#define header input
if header_boolean == True:
header_csv = int(0)
else:
header_csv = None
#read data
data = pd.read_csv(filepath_or_buffer=filepath_or_buffer,
header=header_csv)
#print data
print 'first outcome: \n', data
first outcome:
date time channel grp
0 -- xx NPO1 2.0
1 -- xx NPO1 1.2
2 30-11-2014 24:23:00 NPO2 1.8
3 30-12-2014 xx TV Gelderl 0.1
4 30-12-2014 xx TV Nh 0.1
5 30-12-2014 19:23 RTL4 0.4
6 30-12-2014 24:46:00 TV West 0.1
7 30-12-2014 xx O.Brabant 0.1
8 -- xx Rijnmond 0.0
9 31-12-2014 19:54 SBS6 0.2
10 31-12-2014 20:21 NPO1 11.1
11 31-12-2014 xx TV Gelderl 0.1
12 31-12-2014 xx TV Oost 0.2
13 31-12-2014 xx TV Zeeland 0.1
14 31-12-2014 25:24:00 NPO1 0.8
[15 rows x 4 columns]
To adjust the data we need to add some intelligence. First, we have to know which rows are ‘bad’, so we can skip them. Second, we have to parse the datetime values to a datetime format. So, let’s give it a try!
Search bad rows
The first challenge is to search which rows you don’t want to import from the CSV file. With the help of Pandas.read_csv you can add which rows you want to skip. We will use that later on. One of the issues we can see is ‘–’ in the date column and ‘xx’ in the time column. Therefore we should indicate in which column we have an issue.
###################
"""define issues"""
###################
issues = {
"issue_0": {
"column": 0,#integer if you know which column (imagine you have no headers)
"indicator_of_bad_row": '--'},
"issue_1": {
"column": 'time', #string if you know the name of the column
"indicator_of_bad_row": 'xx'}
}
Now we know the issues, we have to determine the ‘bad’ rows. Below we define a function that returns a Numpy array including the ‘bad’ rows. For this we need to take three steps: 1) import data; 2) find ‘bad’ rows per issue and; 3) find unique ‘bad’ rows over all issues.
##################
"""get bad rows"""
##################
def get_bad_rows(filepath_or_buffer, issues, header_csv):
"""return bad rows array that include incorrect rows"""
###Import data
#read data
data = pd.read_csv(filepath_or_buffer=filepath_or_buffer,
header=header_csv)
###Find bad rows per issue
#define column names
column_names = data.columns
#initialize empty bad_rows_array
bad_rows_total_array = np.array([])
#find bad rows per issue
for i in issues:
issue = issues[i]
#determine if column of issue is a string
string_boolean = isinstance(issue['column'], str)
#select data of bad rows if column is a string
if string_boolean == True:
column_name = issue['column']
#select data of bad rows if column is not a string but an integer or float
else:
column_name = column_names[issue['column']]
#select subset of 'bad' rows
data_bad_rows = data[data[column_name] == issue['indicator_of_bad_row']]
#get bad rows per issue
bad_rows_issue_array = data_bad_rows.index.values
#add bad rows to total array of bad rows
bad_rows_total_array = np.concatenate((bad_rows_total_array, bad_rows_issue_array)).astype(int)
###Find unique bad rows for all issues
#get unique bad_rows
bad_rows_total_array = np.unique(bad_rows_total_array)
#if header is true than change value by +1 because of skip function in pd.read_csv
if header_boolean == True:
bad_rows_total_array +=1
return bad_rows_total_array
#get bad rows
bad_rows_array = get_bad_rows(filepath_or_buffer=filepath_or_buffer,
issues=issues,
header_csv=header_csv)
#print bad rows array
print 'bad rows: \n', bad_rows_array
bad rows:
[ 1 2 4 5 8 9 12 13 14]
The outcome looks incorrect at first sight, but it is correct. Pandas takes the header as first row(0) in the header, therefore we have to shift values by one if the CSV file has headers.
Parse dates
Ok, so we know the ‘bad’ rows in the CSV file. But what about the strange datetime values? For example, we see hours that are higher than 24. From experience I know that 31-12-2014 at 25:24:00 means 1-1-2015 at 01:15:00. In this case we need to shift year, month, day and hour to return the correct datetime format. If hour is 24 or higher than we need to shift values. Thereafter we need to check if it is the last day of the month. If yes, we need to change the day value to the first day of the month. The final changes are the month and year values. This depends on the month value. In case it is the last month of the year we need to shift the year value by 1 and the month value by 1. In other cases we only need to shift the month value by 1. Datetime.timedelta can help us out in combination with one if statement (if the hour value is not right) instead of multiple if statements. Below you can see the date parser function.
#################
"""date parser"""
#################
def date_parser(date, time):
"""define date parsers with seperate date and time columns"""
###get day, month, year as integer
date_str = str(date)
day, month, year = map(int, date_str.split("-"))
###get hour, minute and second as integer
time_str = str(time)
count_time_str = time_str.count(':')
if count_time_str == 1:
hour, minute = map(int, time_str.split(":"))
second = int(0)
if count_time_str == 2:
hour, minute, second = map(int, time_str.split(":"))
###correct hour, day, month and year values in case of strange hour value
if hour >= 23:
#get datetime format
datetime_format = datetime.datetime(year, month, day, int(23), minute, second)
#correct datetime because of hour values
datetime_format = datetime_format + datetime.timedelta(hours=(hour-23))
else:
#get datetime format
datetime_format = datetime.datetime(year, month, day, hour, minute, second)
# return
return datetime_format
Import final data
If we skip the ‘bad’ rows and use the date parser when we import the data than our data is ready for data manipulation. Below we can see the results and it looks perfect!
###################################
"""skip bad rows and parse dates"""
###################################
#define positions of datetime columns
date_position = 0
time_position = 1
#import data for data manipulation
data = pd.read_csv(
filepath_or_buffer=filepath_or_buffer,
header=header_csv,
encoding = 'iso-8859-1',
parse_dates = {'datetime': [date_position, time_position]},
date_parser = date_parser,
skiprows=bad_rows_array
)
print 'final outcome: \n', data
data outcome:
datetime channel grp
0 2014-12-01 00:23:00 NPO2 1.8
1 2014-12-30 19:23:00 RTL4 0.4
2 2014-12-31 00:46:00 TV West 0.1
3 2014-12-31 19:54:00 SBS6 0.2
4 2014-12-31 20:21:00 NPO1 11.1
5 2015-01-01 01:24:00 NPO1 0.8
[6 rows x 3 columns]
As we can seen above, only the ‘good’ rows are imported and for example 31-12-2014 at 25:24:00 is adjusted to 1-1-2015 at 01:24:00.
Conclusion
From now on it is easy to skip ‘bad’ rows if there any other issues by adding them to the other issues. The date parser works perfect in this case. However I think it would be a great contribution to Pandas if we will write the ultimate date parser library. There are so many formats and exceptions with datetime values that we wrote dozens of different date parsers at Blue Mango.
Leave a Reply
You must be logged in to post a comment.