In the Digital Analytics Power Hour podcast on R and Python, they mentioned that Python has quite a steep learning curve compared to R, for example when using the Google Analytics reporting API. During my day-to-day activities, I’ve been using the reporting API in Python, but I’ve always had help setting things up. This is why I decided to start over from scratch. In this tutorial you’ll:
- Set up a connection to the Google Analytics Reporting API (v4) with Google Oauth;
- Print your first API report.
- Convert the API report to a pandas DataFrame.
- Export data to Excel (for your non-python friends).
- Bonus: implement a date-hack to reduce sampling.
If you’re new to Python, you should start with these three steps:
- Learn Python, e.g. by following this Code Academy course.
- Install Python and PyCharm. I suggest reading this post until phase 4.
- Get yourself acquainted with virtual environments.
When you’ve read these posts, you’re good to go.
TLDR: If you don’t want to set up the project from scratch, visit my quick_gaapi GitHub repository.
Setting up your Python project
For my Python API project, I’ve created a virtual environment called gaapi
and set it to Python 3.5.2
. I recommend using a 3.x
version as it handles special characters in strings better than 2.7
does. Besides creating a virtual environment, I always like to think of the project’s structure. So let’s create some files up front:
requirements.txt
a list of the packages we’ll use.credentials.py
holds the credentials for your connection.connect.py
holds the code to connect to the Google Analytics API.functions.py
this will hold general functionality.run.py
is the file we’ll mostly be running (and where we’ll get our report data).
We’re now ready to set up our API connection.
1. Getting connected with OAuth
The first thing you’ll need to do is create a project on https://console.developers.google.com/. Erwin Maas has created a great post about this. To create a project, follow the 17 steps in his post.
Welcome back.
With your project ready in the Google Developer Console, we’re ready to set up the API connection. Start by adding credential variables from your project to you credentials.py
file.
client_id = 'YOUR CLIENT ID'
client_secret= 'YOUR CLIENT SECRET'
redirect_uri = 'YOUR REDIRECT URI'
access_code = ''
access_token = ''
refresh_token = ''
The first three values can be copied from your Google project. We’ll generate the last three during this tutorial in quite a hacky manner. Luckily, we’ll only have to do this once.
1.1 setting up your code
After putting your credentials in, add the following code to your connect.py
file:
# import libraries
from credentials import client_id, client_secret, redirect_uri, access_code, access_token, refresh_token
from oauth2client.client import OAuth2WebServerFlow, GoogleCredentials
import httplib2
from googleapiclient.discovery import build
To use the new libraries, we’ll need to add the packages to our environment. You can do so by running the following command from your command line (don’t forget to change gaapi
to the name of your virtual environment and install the all three packages):
C:\Anaconda3\envs\gaapi\Scripts\pip.exe install oauth2client
Besides that, add a line for each package to your requirements.txt
file. This will make PyCharm prompt users to install it if the library isn’t available (especially handy when you share your project!).
After that, add:
# create connection based on project credentials
flow = OAuth2WebServerFlow(client_id=client_id,
client_secret=client_secret,
scope='https://www.googleapis.com/auth/analytics',
redirect_uri=redirect_uri)
# capture different states of connection
if access_code == '':
# first run prints oauth URL
auth_uri = flow.step1_get_authorize_url()
print (auth_uri)
elif access_token == '' and refresh_token == '':
# second run returns access and refresh token
credentials = flow.step2_exchange(access_code)
print(credentials.access_token)
print(credentials.refresh_token)
else:
# third and future run connect through access token an refresh token
credentials = GoogleCredentials(access_token, client_id, client_secret, refresh_token, 3920, 'https://accounts.google.com/o/oauth2/token', 'test')
http = httplib2.Http()
http = credentials.authorize(http)
service = build('analytics', 'v4', http=http)
print(service)
As you can see, I’ve set up an if statement to check the different states of your connection. It’ll smoothen the connection process.
1.2 Going once, going twice, and connected!
You’ll have to run the script three times:
- The first run will print a URL. Open it, allow the app to use your Google Account, and copy the value behind the
&code
parameter. This is your access code. Set it as theaccess_code
variable of your script. It allows you to connect to the Google API once. - In the second run we use the
access_code
to print the access token and refresh token. Set the printed values as theaccess_token
andrefresh_token
variables. These two tokens allow you to keep on connecting to the Google API.
- Where run 1 gave us a code to say "hey, let me in, look at my 1 entrance ticket", the second run gave us codes to say "hey, let me connect, here’s proof that I am allowed too keep on coming in".
- The third and future runs use the access token and refresh token to make a connection with the Google API.
After your third run, you’ll see a printed line similar to this:
<googleapiclient.discovery.Resource object at 0x000001ABCD23EF00>
Good job! You’re now connected to the Google Analytics API. You can remove the print(service)
line at the end of your code.
2. Your first report
Because we’re thinking of project structure op front, we can now leave credentials.py
and connect.py
as is and move on to the run.py
file. The first thing you’ll need to do is import the connect.py
file:
import connect
After that, we’ll add two functions from Google’s Hello Analytics Reporting API v4 quickstart guide: get_report
(this gets us the data) and print_response
(this prints the data in the console).
Don’t forget to add your view id!
def get_report(analytics):
"""Queries the Analytics Reporting API V4.
Args:
analytics: An authorized Analytics Reporting API V4 service object.
Returns:
The Analytics Reporting API V4 response.
"""
return analytics.reports().batchGet(
body={
'reportRequests': [
{
'viewId': 'YOUR VIEW ID',
'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}],
'metrics': [{'expression': 'ga:sessions'}],
'dimensions': [{'name': 'ga:source'}]
}]
}
).execute()
def print_response(response):
"""Parses and prints the Analytics Reporting API V4 response.
Args:
response: An Analytics Reporting API V4 response.
"""
for report in response.get('reports', []):
columnHeader = report.get('columnHeader', {})
dimensionHeaders = columnHeader.get('dimensions', [])
metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])
for row in report.get('data', {}).get('rows', []):
dimensions = row.get('dimensions', [])
dateRangeValues = row.get('metrics', [])
for header, dimension in zip(dimensionHeaders, dimensions):
print (header + ': ' + dimension)
for i, values in enumerate(dateRangeValues):
print ('Date range: ' + str(i))
for metricHeader, value in zip(metricHeaders, values.get('values')):
print (metricHeader.get('name') + ': ' + value)
Alright, one of the most important moments of this tutorial: your first printout of data. Add this line at the end of your run.py
file:
print_response(get_report(connect.service))
This will print the response of the data we get with get_report
, and connects through the service we’ve created in the connect file. Go ahead and run run.py
and you should see your first data.
Before we continue, we’re going to clean up some code. Move the print_response
function to the functions.py
file. To use the function, import print_response
from functions in the run.py
file. Also, remove any comments from the function.
The run.py
file now looks even tidier:
import connect
from functions import print_response
def get_report(analytics):
return analytics.reports().batchGet(
body={
'reportRequests': [
{
'viewId': '100555616',
'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}],
'metrics': [{'expression': 'ga:sessions'}],
'dimensions': [{'name': 'ga:source'}]
}]
}
).execute()
print_response(get_report(connect.service))
We can also simplify the way we use the get_reports
function. Start by moving it over to the functions file. Right now, we can only pass the connect.service
, start_date
and end_date
variables. But within run.py
you also want to pass your view ID, dimensions, metrics etc. To do so, first add the new values to the function and change the values in the functions to capture the newly added values:
def get_report(analytics, start_date, end_date, view_id, metrics, dimensions):
return analytics.reports().batchGet(
body={
'reportRequests': [
{
'viewId': view_id,
'dateRanges': [{'startDate':start_date, 'endDate': end_date}],
'metrics': metrics,
'dimensions': dimensions
}]
}
).execute()
You can use this step to add support for other values, e.g. segments
.
Don’t forget to import the function in your run.py
file:
from functions import print_response, get_report
df = print_response(get_report(
connect.service,
start_date='2017-09-13',
end_date='2017-09-21',
view_id='100555616',
metrics=[{'expression': 'ga:sessions'},],
dimensions=[{'name': 'ga:source'}],
)
print(df)
As a final step, we’re going to simplify the function call in the run.py
file to a single function:
print_response(get_report(...
Import connect
in your functions.py
file and remove the import from run.py
. After that, add a new function:
def return_ga_data(start_date, end_date, view_id, metrics, dimensions):
return print_response(get_report(connect.service, start_date, end_date, view_id, metrics, dimensions))
Now within run.py
, all we have to do is import and call return_ga_data
:
from functions import return_ga_data
df = return_ga_data(
start_date='2017-09-13',
end_date='2017-09-21',
view_id='100555616',
metrics=[{'expression': 'ga:sessions'},],
dimensions=[{'name': 'ga:source'}],
)
print(df)
Your code is now nice and tidy.
3. Return the Google Analytics response as a pandas DataFrame.
Alright, we’ve connected to the Google Analytics API and printed some data. But we’re not there yet. If we want to do something with the data, we need to use pandas. Here’s why:
pandas is an open source BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Before we continue with the serious stuff, how awesome is it to tell your friends you’re working with Python and pandas?
Head back to the functions.py
file and import the Pandas package (don’t forget to install the package to your environment and update your requirements.txt
file!):
import pandas as pd
With the Pandas package installed, we’re ready to change the print_response
function to return a DataFrame. It turns out converting the JSON data response from the Analytics API to a DataFrame is not super-easy. To make it easier, we’ll convert the JSON response into a list
with a dict
per row (a dict
is a data object with keys and values). The list
we’re creating looks like this:
[
{'ga:source': '(direct)', 'ga:sessions': 3},
{'ga:source': 'google', 'ga:sessions': 4},
...
]
We can easily convert this list to a DataFrame, where the keys are the column names and the values fill out each row.
Here’s the new print_response
function to turn the GA data into a DataFrame:
def print_response(response):
list = []
# get report data
for report in response.get('reports', []):
# set column headers
columnHeader = report.get('columnHeader', {})
dimensionHeaders = columnHeader.get('dimensions', [])
metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])
rows = report.get('data', {}).get('rows', [])
for row in rows:
# create dict for each row
dict = {}
dimensions = row.get('dimensions', [])
dateRangeValues = row.get('metrics', [])
# fill dict with dimension header (key) and dimension value (value)
for header, dimension in zip(dimensionHeaders, dimensions):
dict[header] = dimension
# fill dict with metric header (key) and metric value (value)
for i, values in enumerate(dateRangeValues):
for metric, value in zip(metricHeaders, values.get('values')):
#set int as int, float a float
if ',' in value or '.' in value:
dict[metric.get('name')] = float(value)
else:
dict[metric.get('name')] = int(value)
list.append(dict)
df = pd.DataFrame(list)
return df
Now as the last step, the function name print_response
isn’t right for what it does anymore. Go ahead and change the function name to convert_reponse_to_df
and refer to the new function name within the return_ga_data
function.
4. Exporting data
If you’ve made it this far, you are awesome. You have imported GA data with Python as a DataFrame. Sadly, not all of your friends, colleagues and clients are just as cool as you are. Luckily for them, you can export data to several standard files like a good old Excel sheet.
First, we create a new function in functions.py
. Install the xlsxwriter
package (we don’t have to import it) and after that, add this function:
def save_df_to_excel(df, path, file_name, sheet_name):
writer = pd.ExcelWriter(path+file_name+'.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name=sheet_name)
writer.save()
Now when you import the function, you can export a DataFrame to a good-old Excel file:
save_df_to_excel(
df=df,
path='C:\\Users\\Erik\\Documents\\',
file_name='test_export',
sheet_name='data'
)
You need to escape slashes in the file path.
5. Optional: the date hack
In 2016 we’ve discussed a neat little trick to greatly reduce the amount of sampled data in your data exports. This is especially valuable for high traffic non-360 Google Analytics properties that often return sampled data. As an optional bonus, here’s a guide to implementing it into the project we’ve been working on.
First up: we need to import date-handling functions. At the top of your ‘function.py’ code, add:
from datetime import datetime, timedelta
from time import sleep
After that, we’ll change the return_ga_data
function:
- add a new value called
split_dates
: this allows us to split the dates optionally. - add a new value called
group_by
: to group data that are imported on a daily basis (e.g. by source or week). - add the functionality to convert a date range into one request per date in the range.
Here’s what our new function looks like:
def return_ga_data(start_date, end_date, view_id, metrics, dimensions, split_dates, group_by):
if split_dates == False:
return convert_reponse_to_df(get_report(connect.service, start_date, end_date, view_id, metrics, dimensions))
else:
start_date = datetime.strptime(start_date, '%Y-%m-%d').date()
end_date = datetime.strptime(end_date, '%Y-%m-%d').date()
delta = end_date - start_date # timedelta
dates = []
for i in range(delta.days + 1):
dates.append(start_date + timedelta(days=i))
df_total = pd.DataFrame()
for date in dates:
date = str(date)
df_total = df_total.append(convert_reponse_to_df(get_report(connect.service, date, date, view_id, metrics, dimensions)))
sleep(1)
if len(group_by) != 0:
df_total = df_total.groupby(group_by).sum()
return df_total
Here’s what the split date hack does:
- Convert the date strings into date objects.
- Get the time difference.
- Fill a list with all the dates from
start_date
toend_date
. - Add the DataFrame from each date to
df_total
.
- the 1-second sleep makes sure you don’t cross the API quota (100 requests per 100 seconds per user).
- Optionally group by a set range of dimensions.
- Return
df_total
.
Et voila, you just gave yourself a Google Analytics data import function with optional date split and group by functionality.
That’s all folks
This tutorial is designed to help you get started with the Google Analytics Reporting API (v4) in Python and give you a Pandas DataFrame to work with. Keep in mind that it’s just a foundation. As a next step, you can follow this pandas course by Thom Hopmans to develop your Pandas skills.
For the latest version of this project, visit the quick_gaapi GitHub repo.
Leave a Reply
You must be logged in to post a comment.