Getting started with the Google Analytics Reporting API in Python

In the Digital Analytics Power Hour podcast on R and Python, they mentioned that Python has quite a steep learning curve compared to R, for example when using the Google Analytics reporting API. During my day-to-day activities, I’ve been using the reporting API in Python, but I’ve always had help setting things up. This is why I decided to start over from scratch. In this tutorial you’ll:

Set up a connection to the Google Analytics Reporting API (v4) with Google Oauth;
Print your first API report.
Convert the API report to a pandas DataFrame.
Export data to Excel (for your non-python friends).
Bonus: implement a date-hack to reduce sampling.

If you’re new to Python, you should start with these three steps:

Learn Python, e.g. by following this Code Academy course.
Install Python and PyCharm. I suggest reading this post until phase 4.
Get yourself acquainted with virtual environments.

When you’ve read these posts, you’re good to go.

TLDR: If you don’t want to set up the project from scratch, visit my quick_gaapi GitHub repository.

Setting up your Python project

For my Python API project, I’ve created a virtual environment called gaapi and set it to Python 3.5.2. I recommend using a 3.x version as it handles special characters in strings better than 2.7 does. Besides creating a virtual environment, I always like to think of the project’s structure. So let’s create some files up front:

requirements.txt a list of the packages we’ll use.
credentials.py holds the credentials for your connection.
connect.py holds the code to connect to the Google Analytics API.
functions.py this will hold general functionality.
run.py is the file we’ll mostly be running (and where we’ll get our report data).

We’re now ready to set up our API connection.

1. Getting connected with OAuth

The first thing you’ll need to do is create a project on https://console.developers.google.com/. Erwin Maas has created a great post about this. To create a project, follow the 17 steps in his post.

Welcome back.

With your project ready in the Google Developer Console, we’re ready to set up the API connection. Start by adding credential variables from your project to you credentials.py file.

client_id = 'YOUR CLIENT ID'
  client_secret= 'YOUR CLIENT SECRET'
  redirect_uri = 'YOUR REDIRECT URI'
  access_code = ''
  access_token = ''
  refresh_token = ''

The first three values can be copied from your Google project. We’ll generate the last three during this tutorial in quite a hacky manner. Luckily, we’ll only have to do this once.

1.1 setting up your code

After putting your credentials in, add the following code to your connect.py file:

# import libraries
  from credentials import client_id, client_secret, redirect_uri, access_code, access_token, refresh_token
  from oauth2client.client import OAuth2WebServerFlow, GoogleCredentials
  import httplib2
  from googleapiclient.discovery import build

To use the new libraries, we’ll need to add the packages to our environment. You can do so by running the following command from your command line (don’t forget to change gaapi to the name of your virtual environment and install the all three packages):

C:\Anaconda3\envs\gaapi\Scripts\pip.exe install oauth2client

Besides that, add a line for each package to your requirements.txt file. This will make PyCharm prompt users to install it if the library isn’t available (especially handy when you share your project!).

After that, add:

# create connection based on project credentials
  flow = OAuth2WebServerFlow(client_id=client_id,
                             client_secret=client_secret,
                             scope='https://www.googleapis.com/auth/analytics',
                             redirect_uri=redirect_uri)

  # capture different states of connection
  if access_code == '':
      # first run prints oauth URL
      auth_uri = flow.step1_get_authorize_url()
      print (auth_uri)
  elif access_token == '' and refresh_token == '':
      # second run returns access and refresh token
      credentials = flow.step2_exchange(access_code)
      print(credentials.access_token)
      print(credentials.refresh_token)
  else:
      # third and future run connect through access token an refresh token
      credentials = GoogleCredentials(access_token, client_id, client_secret, refresh_token, 3920, 'https://accounts.google.com/o/oauth2/token', 'test')
      http = httplib2.Http()
      http = credentials.authorize(http)
      service = build('analytics', 'v4', http=http)
      print(service)

As you can see, I’ve set up an if statement to check the different states of your connection. It’ll smoothen the connection process.

1.2 Going once, going twice, and connected!

You’ll have to run the script three times:

The first run will print a URL. Open it, allow the app to use your Google Account, and copy the value behind the &code parameter. This is your access code. Set it as the access_code variable of your script. It allows you to connect to the Google API once.
In the second run we use the access_code to print the access token and refresh token. Set the printed values as the access_token and refresh_token variables. These two tokens allow you to keep on connecting to the Google API.

Where run 1 gave us a code to say "hey, let me in, look at my 1 entrance ticket", the second run gave us codes to say "hey, let me connect, here’s proof that I am allowed too keep on coming in".

The third and future runs use the access token and refresh token to make a connection with the Google API.

After your third run, you’ll see a printed line similar to this:

<googleapiclient.discovery.Resource object at 0x000001ABCD23EF00>

Good job! You’re now connected to the Google Analytics API. You can remove the print(service) line at the end of your code.

2. Your first report

Because we’re thinking of project structure op front, we can now leave credentials.py and connect.py as is and move on to the run.py file. The first thing you’ll need to do is import the connect.py file:

import connect

After that, we’ll add two functions from Google’s Hello Analytics Reporting API v4 quickstart guide: get_report (this gets us the data) and print_response (this prints the data in the console).

Don’t forget to add your view id!

def get_report(analytics):
    """Queries the Analytics Reporting API V4.

    Args:
      analytics: An authorized Analytics Reporting API V4 service object.
    Returns:
      The Analytics Reporting API V4 response.
    """
    return analytics.reports().batchGet(
        body={
          'reportRequests': [
          {
            'viewId': 'YOUR VIEW ID',
            'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}],
            'metrics': [{'expression': 'ga:sessions'}],
            'dimensions': [{'name': 'ga:source'}]
          }]
        }
    ).execute()

  def print_response(response):
    """Parses and prints the Analytics Reporting API V4 response.

    Args:
      response: An Analytics Reporting API V4 response.
    """
    for report in response.get('reports', []):
      columnHeader = report.get('columnHeader', {})
      dimensionHeaders = columnHeader.get('dimensions', [])
      metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])

      for row in report.get('data', {}).get('rows', []):
        dimensions = row.get('dimensions', [])
        dateRangeValues = row.get('metrics', [])

        for header, dimension in zip(dimensionHeaders, dimensions):
          print (header + ': ' + dimension)

        for i, values in enumerate(dateRangeValues):
          print ('Date range: ' + str(i))
          for metricHeader, value in zip(metricHeaders, values.get('values')):
            print (metricHeader.get('name') + ': ' + value)

Alright, one of the most important moments of this tutorial: your first printout of data. Add this line at the end of your run.py file:

print_response(get_report(connect.service))

This will print the response of the data we get with get_report, and connects through the service we’ve created in the connect file. Go ahead and run run.py and you should see your first data.

Before we continue, we’re going to clean up some code. Move the print_response function to the functions.py file. To use the function, import print_response from functions in the run.py file. Also, remove any comments from the function.

The run.py file now looks even tidier:

import connect
  from functions import print_response

  def get_report(analytics):
    return analytics.reports().batchGet(
        body={
          'reportRequests': [
          {
            'viewId': '100555616',
            'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}],
            'metrics': [{'expression': 'ga:sessions'}],
            'dimensions': [{'name': 'ga:source'}]
          }]
        }
    ).execute()

  print_response(get_report(connect.service))

We can also simplify the way we use the get_reports function. Start by moving it over to the functions file. Right now, we can only pass the connect.service, start_date and end_date variables. But within run.py you also want to pass your view ID, dimensions, metrics etc. To do so, first add the new values to the function and change the values in the functions to capture the newly added values:

def get_report(analytics, start_date, end_date, view_id, metrics, dimensions):
    return analytics.reports().batchGet(
        body={
          'reportRequests': [
          {
            'viewId': view_id,
            'dateRanges': [{'startDate':start_date, 'endDate': end_date}],
            'metrics': metrics,
            'dimensions': dimensions
          }]
        }
    ).execute()

You can use this step to add support for other values, e.g. segments.

Don’t forget to import the function in your run.py file:

from functions import print_response, get_report

  df = print_response(get_report(
    connect.service,
    start_date='2017-09-13',
    end_date='2017-09-21',
    view_id='100555616',
    metrics=[{'expression': 'ga:sessions'},],
    dimensions=[{'name': 'ga:source'}],
  )

  print(df)

As a final step, we’re going to simplify the function call in the run.py file to a single function:

print_response(get_report(...

Import connect in your functions.py file and remove the import from run.py. After that, add a new function:

def return_ga_data(start_date, end_date, view_id, metrics, dimensions):
    return print_response(get_report(connect.service, start_date, end_date, view_id, metrics, dimensions))

Now within run.py, all we have to do is import and call return_ga_data:

from functions import return_ga_data

  df = return_ga_data(
    start_date='2017-09-13',
    end_date='2017-09-21',
    view_id='100555616',
    metrics=[{'expression': 'ga:sessions'},],
    dimensions=[{'name': 'ga:source'}],
  )

  print(df)

Your code is now nice and tidy.

3. Return the Google Analytics response as a pandas DataFrame.

Alright, we’ve connected to the Google Analytics API and printed some data. But we’re not there yet. If we want to do something with the data, we need to use pandas. Here’s why:

pandas is an open source BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Before we continue with the serious stuff, how awesome is it to tell your friends you’re working with Python and pandas?

Head back to the functions.py file and import the Pandas package (don’t forget to install the package to your environment and update your requirements.txt file!):

import pandas as pd

With the Pandas package installed, we’re ready to change the print_response function to return a DataFrame. It turns out converting the JSON data response from the Analytics API to a DataFrame is not super-easy. To make it easier, we’ll convert the JSON response into a list with a dict per row (a dict is a data object with keys and values). The list we’re creating looks like this:

[
    {'ga:source': '(direct)', 'ga:sessions': 3},
    {'ga:source': 'google', 'ga:sessions': 4},
    ...
  ]

We can easily convert this list to a DataFrame, where the keys are the column names and the values fill out each row.

Here’s the new print_response function to turn the GA data into a DataFrame:

def print_response(response):
    list = []
    # get report data
    for report in response.get('reports', []):
      # set column headers
      columnHeader = report.get('columnHeader', {})
      dimensionHeaders = columnHeader.get('dimensions', [])
      metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])
      rows = report.get('data', {}).get('rows', [])

      for row in rows:
          # create dict for each row
          dict = {}
          dimensions = row.get('dimensions', [])
          dateRangeValues = row.get('metrics', [])

          # fill dict with dimension header (key) and dimension value (value)
          for header, dimension in zip(dimensionHeaders, dimensions):
            dict[header] = dimension

          # fill dict with metric header (key) and metric value (value)
          for i, values in enumerate(dateRangeValues):
            for metric, value in zip(metricHeaders, values.get('values')):
              #set int as int, float a float
              if ',' in value or '.' in value:
                dict[metric.get('name')] = float(value)
              else:
                dict[metric.get('name')] = int(value)

          list.append(dict)

      df = pd.DataFrame(list)
      return df

Now as the last step, the function name print_response isn’t right for what it does anymore. Go ahead and change the function name to convert_reponse_to_df and refer to the new function name within the return_ga_data function.

4. Exporting data

If you’ve made it this far, you are awesome. You have imported GA data with Python as a DataFrame. Sadly, not all of your friends, colleagues and clients are just as cool as you are. Luckily for them, you can export data to several standard files like a good old Excel sheet.

First, we create a new function in functions.py. Install the xlsxwriter package (we don’t have to import it) and after that, add this function:

def save_df_to_excel(df, path, file_name, sheet_name):
    writer = pd.ExcelWriter(path+file_name+'.xlsx', engine='xlsxwriter')
    df.to_excel(writer, sheet_name=sheet_name)
    writer.save()

Now when you import the function, you can export a DataFrame to a good-old Excel file:

save_df_to_excel(
    df=df,
    path='C:\\Users\\Erik\\Documents\\',
    file_name='test_export',
    sheet_name='data'
  )

You need to escape slashes in the file path.

5. Optional: the date hack

In 2016 we’ve discussed a neat little trick to greatly reduce the amount of sampled data in your data exports. This is especially valuable for high traffic non-360 Google Analytics properties that often return sampled data. As an optional bonus, here’s a guide to implementing it into the project we’ve been working on.

First up: we need to import date-handling functions. At the top of your ‘function.py’ code, add:

from datetime import datetime, timedelta
  from time import sleep

After that, we’ll change the return_ga_data function:

add a new value called split_dates: this allows us to split the dates optionally.
add a new value called group_by: to group data that are imported on a daily basis (e.g. by source or week).
add the functionality to convert a date range into one request per date in the range.

Here’s what our new function looks like:

def return_ga_data(start_date, end_date, view_id, metrics, dimensions, split_dates, group_by):
      if split_dates == False:
          return convert_reponse_to_df(get_report(connect.service, start_date, end_date, view_id, metrics, dimensions))
      else:
          start_date = datetime.strptime(start_date, '%Y-%m-%d').date()
          end_date = datetime.strptime(end_date, '%Y-%m-%d').date()
          delta = end_date - start_date         # timedelta
          dates = []

          for i in range(delta.days + 1):
              dates.append(start_date + timedelta(days=i))

          df_total = pd.DataFrame()
          for date in dates:
              date = str(date)
              df_total = df_total.append(convert_reponse_to_df(get_report(connect.service, date, date, view_id, metrics, dimensions)))
              sleep(1)

          if len(group_by) != 0:
              df_total = df_total.groupby(group_by).sum()

          return df_total

Here’s what the split date hack does:

Convert the date strings into date objects.
Get the time difference.
Fill a list with all the dates from start_date to end_date.
Add the DataFrame from each date to df_total.

the 1-second sleep makes sure you don’t cross the API quota (100 requests per 100 seconds per user).

Optionally group by a set range of dimensions.
Return df_total.

Et voila, you just gave yourself a Google Analytics data import function with optional date split and group by functionality.

That’s all folks

This tutorial is designed to help you get started with the Google Analytics Reporting API (v4) in Python and give you a Pandas DataFrame to work with. Keep in mind that it’s just a foundation. As a next step, you can follow this pandas course by Thom Hopmans to develop your Pandas skills.

For the latest version of this project, visit the quick_gaapi GitHub repo.