As Data Scientists in the world of advertising we often need to proof the direct and delayed effects of advertising. A phenomena what we see in the last years is that consumers use multiple screens when they watch television. This is called screen-stacking. After a television commercial they can visit the website of the brand or product if they gained interest.
A couple of years ago this was done with data on week or day level. Nowadays we model on minute level at Blue Mango. We found direct interaction effects in the same minute between the commercial and website visits. In the twenty minutes after the commercial we still see an uplift in visits. To model the direct and delayed effect , we make use of ad stock variables. Therefore we first have to calculate ad stock values. On one hand the code should be easy to read. On the other hand we hate losing time on data collection, because we love models.
Creating a dataset
Today we only talk about data collection. Later on I will post a blog about hierarchical Bayesian models to determine the attributed value of TV commercials. One of the awesome packages in Python for data collection is Pandas. Yes, I am that guy that is addicted to Pandas. In the example below, we use Pandas to create a dataset of one year of data on minute level. The length of the dataset is 525 600 rows (365 days * 24 hours * 60 minutes). For this example we add one column with GRP’s of commercials on the TV channel RTL 4. GRP’s are the percent of the target group that watched the commercial in that specific minute. For now we only consider one commercial for the entire year.
#######################
"""import libraries."""
#######################
import datetime
import numpy as np
import pandas as pd
#####################
"""create dataset."""
#####################
def create_dataset(start_datetime=datetime.datetime(year=2014, month=1, day=1),
end_datetime=datetime.datetime(year=2015, month=1, day=1)):
"""Create data for 1 year of data on minute level and 1 column with GRP's on RTL4"""
#create datetime values between start_datetime and start_datetime on minute level
time_index = np.arange(start_datetime, end_datetime, dtype='datetime64[m]')
#create dataframe including time_index
data = pd.DataFrame(index=time_index)
#define index name
data.index.name = 'datetime'
#add column with zero's for GRP's on RTL4
data ['grp_on_rtl4'] = float(0)
#add ten GRP's in third minute on the first of January for one channel in an entire year
data.ix[datetime.datetime(year=2014, month=1, day=1, hour=0, minute=2),
'grp_on_rtl4'] = 10
return data
#create dataset and print first and last rows of data
data = create_dataset()
print '\n First five rows of data: \n', data[:5]
print '\n Last five rows of data: \n', data[-5:]
First five rows of data:
grp_on_rtl4
datetime
2014-01-01 00:00:00 0
2014-01-01 00:01:00 0
2014-01-01 00:02:00 10
2014-01-01 00:03:00 0
2014-01-01 00:04:00 0
[5 rows x 1 columns]
Last five rows of data:
grp_on_rtl4
datetime
2014-12-31 23:55:00 0
2014-12-31 23:56:00 0
2014-12-31 23:57:00 0
2014-12-31 23:58:00 0
2014-12-31 23:59:00 0
[5 rows x 1 columns]
Calculate ad stocks: the slow way
To calculate ad stock values we make use of an engagement factor of 0.9. So one minute after a commercial 90% of the consumers are engaged through the commercial and two minutes after a commercial 81% (0.9*0.9) of the consumers are still engaged through the commercial and so forth. Below we see a simple for loop that directly adds the ad stock values to the dataset. The disadvantage is that in every iteration we have to make use of an index operation. Index operations are great for quickly setting up a small program and data manipulation. But again, I hate losing time on data collection! In this case it takes more than one minute and this is just for one variable.
########################################################
"""test speed of non-vectorized ad stock calculations"""
########################################################
def calculate_ad_stocks_non_vectorized(data, engagement_factor=0.9):
"""Calculate ad stocks on a non-vectorized and slow way"""
#initialize ad stock column
data['grp_on_rtl4_ad_stock'] = 0
#initialize ad_stock_value
ad_stock_value = 0
#loop through dataset to calculate ad stock values
for index, row in data.iterrows():
ad_stock_value = row['grp_on_rtl4'] + ad_stock_value * engagement_factor
data.ix[index, 'grp_on_rtl4_ad_stock'] = ad_stock_value
return data
#calculate time spend in non-vectorized and slow way
time_start = datetime.datetime.now()
data_ad_stocks = calculate_ad_stocks_non_vectorized(data=data,
engagement_factor=0.9)
time_end = datetime.datetime.now()
time_spend_non_vectorized = time_end - time_start
print '\n Time spend for ad stock transformation in non-vectorized and slow way: \n', time_spend_non_vectorized
print '\n First ten rows of data including ad stock values: \n', data_ad_stocks[:10]
Time spend for ad stock transformation in non-vectorized and slow way:
0:01:08.984000
First ten rows of data including ad stock values:
grp_on_rtl4 grp_on_rtl4_ad_stock
datetime
2014-01-01 00:00:00 0 0.000000
2014-01-01 00:01:00 0 0.000000
2014-01-01 00:02:00 10 10.000000
2014-01-01 00:03:00 0 9.000000
2014-01-01 00:04:00 0 8.100000
2014-01-01 00:05:00 0 7.290000
2014-01-01 00:06:00 0 6.561000
2014-01-01 00:07:00 0 5.904900
2014-01-01 00:08:00 0 5.314410
2014-01-01 00:09:00 0 4.782969
[10 rows x 2 columns]
Calculate ad stocks: the fast way
So if we hate losing time, why not make use of vector operations that are way faster than index operations? The first change we make below is initializing a vector for ad stock values instead of adding a new column to the dataset. In the for loop we append the ad stock values to the vector instead of adding the values directly to the dataset. Finally, we add the vector to the dataset as a new column after the for loop. In this way we can win a free minute, awesome!
####################################################
"""test speed of vectorized ad stock calculations"""
####################################################
def calculate_ad_stocks_vectorized(data, engagement_factor):
"""Calculate ad stocks in fast way"""
#initialize ad stock vector
grp_on_rtl4_ad_stock_vector = []
#initialize ad_stock_value
ad_stock_value = 0
#loop through dataset to calculate ad stock values
for index, row in data.iterrows():
ad_stock_value = row['grp_on_rtl4'] + ad_stock_value * engagement_factor
grp_on_rtl4_ad_stock_vector.append(ad_stock_value)
#add ad stock vector to dataset
data['grp_on_rtl4_ad_stock'] = grp_on_rtl4_ad_stock_vector
return data
#calculate time spend in vectorized and fast way
time_start = datetime.datetime.now()
data_ad_stocks = calculate_ad_stocks_vectorized(data=data,
engagement_factor=0.9)
time_end = datetime.datetime.now()
time_spend_vectorized = time_end - time_start
print '\n Time spend for ad stock transformation in vectorized and fast way: \n', time_spend_vectorized
print '\n First ten rows of data including ad stock values: \n', data_ad_stocks[:10]
#calculate speed up factor
speed_up_factor = time_spend_non_vectorized.total_seconds() / time_spend_vectorized.total_seconds()
print '\n Speed up factor: \n', speed_up_factor
Time spend for ad stock transformation in vectorized and fast way:
0:00:15.557000
First ten rows of data including ad stock values:
grp_on_rtl4 grp_on_rtl4_ad_stock
datetime
2014-01-01 00:00:00 0 0.000000
2014-01-01 00:01:00 0 0.000000
2014-01-01 00:02:00 10 10.000000
2014-01-01 00:03:00 0 9.000000
2014-01-01 00:04:00 0 8.100000
2014-01-01 00:05:00 0 7.290000
2014-01-01 00:06:00 0 6.561000
2014-01-01 00:07:00 0 5.904900
2014-01-01 00:08:00 0 5.314410
2014-01-01 00:09:00 0 4.782969
[10 rows x 2 columns]
Speed up factor:
4.43427396028
As we have seen above, you can speed up data manipulation in Pandas approximately 4.43 times by one line of code that contains a vector operation. In case of multiple variables I use a matrix manipulation instead of a vector manipulation per variable. In this way you can win approximately ten minutes if you have ten variables.
Next time I will blog what you can do with this data with the help of hierarchical Bayesian models. So we can determine the attributed value per TV commercial. Below you can already see a sneak preview of what we see in the actual data.
To be continued..
Leave a Reply
You must be logged in to post a comment.