Calculating ad stocks in a fast and readable way in Python

As Data Scientists in the world of advertising we often need to proof the direct and delayed effects of advertising. A phenomena what we see in the last years is that consumers use multiple screens when they watch television. This is called screen-stacking. After a television commercial they can visit the website of the brand or product if they gained interest.

A couple of years ago this was done with data on week or day level. Nowadays we model on minute level at Blue Mango. We found direct interaction effects in the same minute between the commercial and website visits. In the twenty minutes after the commercial we still see an uplift in visits. To model the direct and delayed effect , we make use of ad stock variables. Therefore we first have to calculate ad stock values. On one hand the code should be easy to read. On the other hand we hate losing time on data collection, because we love models.

Creating a dataset

Today we only talk about data collection. Later on I will post a blog about hierarchical Bayesian models to determine the attributed value of TV commercials. One of the awesome packages in Python for data collection is Pandas. Yes, I am that guy that is addicted to Pandas. In the example below, we use Pandas to create a dataset of one year of data on minute level. The length of the dataset is 525 600 rows (365 days * 24 hours * 60 minutes). For this example we add one column with GRP’s of commercials on the TV channel RTL 4. GRP’s are the percent of the target group that watched the commercial in that specific minute. For now we only consider one commercial for the entire year.

#######################
  """import libraries."""
  #######################

  import datetime
  import numpy as np
  import pandas as pd


  #####################
  """create dataset."""
  #####################

  def create_dataset(start_datetime=datetime.datetime(year=2014, month=1, day=1),
                     end_datetime=datetime.datetime(year=2015,   month=1, day=1)):
      """Create data for 1 year of data on minute level and 1 column with GRP's on RTL4"""
      #create datetime values between start_datetime and start_datetime on minute level
      time_index = np.arange(start_datetime, end_datetime, dtype='datetime64[m]')
      #create dataframe including time_index
      data = pd.DataFrame(index=time_index)
      #define index name
      data.index.name = 'datetime'
      #add column with zero's for GRP's on RTL4
      data ['grp_on_rtl4'] = float(0)
      #add ten GRP's in third minute on the first of January for one channel in an entire year
      data.ix[datetime.datetime(year=2014, month=1, day=1, hour=0, minute=2),
                                'grp_on_rtl4'] = 10
      return data

  #create dataset and print first and last rows of data
  data = create_dataset()
  print '\n First five rows of data: \n', data[:5]
  print '\n Last five rows of data:  \n', data[-5:]

First five rows of data:
                       grp_on_rtl4
  datetime
  2014-01-01 00:00:00            0
  2014-01-01 00:01:00            0
  2014-01-01 00:02:00           10
  2014-01-01 00:03:00            0
  2014-01-01 00:04:00            0

  [5 rows x 1 columns]

  Last five rows of data:
                       grp_on_rtl4
  datetime
  2014-12-31 23:55:00            0
  2014-12-31 23:56:00            0
  2014-12-31 23:57:00            0
  2014-12-31 23:58:00            0
  2014-12-31 23:59:00            0

  [5 rows x 1 columns]

Calculate ad stocks: the slow way

To calculate ad stock values we make use of an engagement factor of 0.9. So one minute after a commercial 90% of the consumers are engaged through the commercial and two minutes after a commercial 81% (0.9*0.9) of the consumers are still engaged through the commercial and so forth. Below we see a simple for loop that directly adds the ad stock values to the dataset. The disadvantage is that in every iteration we have to make use of an index operation. Index operations are great for quickly setting up a small program and data manipulation. But again, I hate losing time on data collection! In this case it takes more than one minute and this is just for one variable.

########################################################
  """test speed of non-vectorized ad stock calculations"""
  ########################################################

  def calculate_ad_stocks_non_vectorized(data, engagement_factor=0.9):
      """Calculate ad stocks on a non-vectorized and slow way"""
      #initialize ad stock column
      data['grp_on_rtl4_ad_stock'] = 0
      #initialize ad_stock_value
      ad_stock_value = 0
      #loop through dataset to calculate ad stock values
      for index, row in data.iterrows():
          ad_stock_value = row['grp_on_rtl4'] + ad_stock_value * engagement_factor
          data.ix[index, 'grp_on_rtl4_ad_stock'] = ad_stock_value
      return data

  #calculate time spend in non-vectorized and slow way
  time_start = datetime.datetime.now()
  data_ad_stocks = calculate_ad_stocks_non_vectorized(data=data,
                                                      engagement_factor=0.9)
  time_end = datetime.datetime.now()
  time_spend_non_vectorized = time_end - time_start
  print '\n Time spend for ad stock transformation in non-vectorized and slow way:  \n', time_spend_non_vectorized
  print '\n First ten rows of data including ad stock values:  \n', data_ad_stocks[:10]

Time spend for ad stock transformation in non-vectorized and slow way:
  0:01:08.984000

  First ten rows of data including ad stock values:
                       grp_on_rtl4  grp_on_rtl4_ad_stock
  datetime
  2014-01-01 00:00:00            0              0.000000
  2014-01-01 00:01:00            0              0.000000
  2014-01-01 00:02:00           10             10.000000
  2014-01-01 00:03:00            0              9.000000
  2014-01-01 00:04:00            0              8.100000
  2014-01-01 00:05:00            0              7.290000
  2014-01-01 00:06:00            0              6.561000
  2014-01-01 00:07:00            0              5.904900
  2014-01-01 00:08:00            0              5.314410
  2014-01-01 00:09:00            0              4.782969

  [10 rows x 2 columns]

Calculate ad stocks: the fast way

So if we hate losing time, why not make use of vector operations that are way faster than index operations? The first change we make below is initializing a vector for ad stock values instead of adding a new column to the dataset. In the for loop we append the ad stock values to the vector instead of adding the values directly to the dataset. Finally, we add the vector to the dataset as a new column after the for loop. In this way we can win a free minute, awesome!

####################################################
  """test speed of vectorized ad stock calculations"""
  ####################################################

  def calculate_ad_stocks_vectorized(data, engagement_factor):
      """Calculate ad stocks in fast way"""
      #initialize ad stock vector
      grp_on_rtl4_ad_stock_vector = []
      #initialize ad_stock_value
      ad_stock_value = 0
      #loop through dataset to calculate ad stock values
      for index, row in data.iterrows():
          ad_stock_value = row['grp_on_rtl4'] + ad_stock_value * engagement_factor
          grp_on_rtl4_ad_stock_vector.append(ad_stock_value)
      #add ad stock vector to dataset
      data['grp_on_rtl4_ad_stock'] = grp_on_rtl4_ad_stock_vector
      return data

  #calculate time spend in vectorized and fast way
  time_start = datetime.datetime.now()
  data_ad_stocks = calculate_ad_stocks_vectorized(data=data,
                                                  engagement_factor=0.9)
  time_end = datetime.datetime.now()
  time_spend_vectorized = time_end - time_start
  print '\n Time spend for ad stock transformation in vectorized and fast way:  \n', time_spend_vectorized
  print '\n First ten rows of data including ad stock values:  \n', data_ad_stocks[:10]

  #calculate speed up factor
  speed_up_factor = time_spend_non_vectorized.total_seconds() / time_spend_vectorized.total_seconds()
  print '\n Speed up factor:  \n', speed_up_factor

Time spend for ad stock transformation in vectorized and fast way:
  0:00:15.557000

  First ten rows of data including ad stock values:
                       grp_on_rtl4  grp_on_rtl4_ad_stock
  datetime
  2014-01-01 00:00:00            0              0.000000
  2014-01-01 00:01:00            0              0.000000
  2014-01-01 00:02:00           10             10.000000
  2014-01-01 00:03:00            0              9.000000
  2014-01-01 00:04:00            0              8.100000
  2014-01-01 00:05:00            0              7.290000
  2014-01-01 00:06:00            0              6.561000
  2014-01-01 00:07:00            0              5.904900
  2014-01-01 00:08:00            0              5.314410
  2014-01-01 00:09:00            0              4.782969

  [10 rows x 2 columns]

  Speed up factor:
  4.43427396028

As we have seen above, you can speed up data manipulation in Pandas approximately 4.43 times by one line of code that contains a vector operation. In case of multiple variables I use a matrix manipulation instead of a vector manipulation per variable. In this way you can win approximately ten minutes if you have ten variables.

Next time I will blog what you can do with this data with the help of hierarchical Bayesian models. So we can determine the attributed value per TV commercial. Below you can already see a sneak preview of what we see in the actual data.

Interaction between TV commercial and online website traffic