Cohort Retention Analysis in BigQuery

Cohort retention analysis is one of those things that sounds fancy but boils down to a simple question: are your users coming back? This post walks through building it from scratch in BigQuery — no external datasets, just a small synthetic table we’ll create together.

What is a Cohort?

A cohort is a group of users who share a common starting point — usually their first interaction with your product. The most common definition is: all users who first showed up in the same week (or month).

So users who first visited your app during the week of 2024-01-01 form one cohort, and users who first visited during 2024-01-08 form another.

What is Retention?

Retention measures how many users from a cohort came back in a later period.

  • 100 users join in Week 0
  • 40 of them return in Week 1 → 40% Week 1 retention
  • 20 of them return in Week 2 → 20% Week 2 retention

Week 0 retention is always 100% — it’s just the cohort size itself.

Cohort retention workflow and heatmap

Setting Up Sample Data

We’ll create a simple user_events table with a user_id and an event_date. This mimics any real events table — pageviews, app opens, purchases, whatever your product tracks.

CREATE OR REPLACE TABLE `your_project.analytics_demo.user_events` AS
SELECT * FROM UNNEST([
  STRUCT("user_1" AS user_id, DATE "2024-01-01" AS event_date),
  ("user_1", DATE "2024-01-08"),
  ("user_1", DATE "2024-01-15"),

  ("user_2", DATE "2024-01-01"),
  ("user_2", DATE "2024-01-08"),

  ("user_3", DATE "2024-01-08"),
  ("user_3", DATE "2024-01-15"),

  ("user_4", DATE "2024-01-15"),
  ("user_4", DATE "2024-01-22"),

  ("user_5", DATE "2024-01-15")
]);

A quick sanity check before we start:

-- total distinct users
SELECT COUNT(DISTINCT user_id) AS total_users
FROM `your_project.analytics_demo.user_events`;

-- active users per week
SELECT
  DATE_TRUNC(event_date, WEEK) AS week,
  COUNT(DISTINCT user_id) AS active_users
FROM `your_project.analytics_demo.user_events`
GROUP BY 1
ORDER BY 1;

Building Cohort Retention

The analysis has a few clear pieces. We’ll build each one, then combine them.

Step 1: Find each user’s first activity date

WITH first_activity AS (
  SELECT
    user_id,
    MIN(event_date) AS first_date
  FROM `your_project.analytics_demo.user_events`
  GROUP BY user_id
)
SELECT * FROM first_activity;

Step 2: Assign users to a cohort week

We truncate first_date to the week boundary so users who joined on different days within the same week are grouped together.

WITH first_activity AS (
  SELECT
    user_id,
    MIN(event_date) AS first_date
  FROM `your_project.analytics_demo.user_events`
  GROUP BY user_id
)
SELECT
  user_id,
  DATE_TRUNC(first_date, WEEK) AS cohort_week
FROM first_activity;

Step 3: Get all weekly activity

SELECT
  user_id,
  DATE_TRUNC(event_date, WEEK) AS activity_week
FROM `your_project.analytics_demo.user_events`;

Step 4: Join cohort to activity

This gives us one row per (user, activity week), carrying the user’s cohort week along.

WITH first_activity AS (
  SELECT user_id, MIN(event_date) AS first_date
  FROM `your_project.analytics_demo.user_events`
  GROUP BY user_id
),
cohort AS (
  SELECT
    user_id,
    DATE_TRUNC(first_date, WEEK) AS cohort_week
  FROM first_activity
),
activity AS (
  SELECT
    user_id,
    DATE_TRUNC(event_date, WEEK) AS activity_week
  FROM `your_project.analytics_demo.user_events`
)
SELECT
  c.cohort_week,
  a.activity_week,
  c.user_id
FROM cohort c
JOIN activity a
  ON c.user_id = a.user_id
  AND a.activity_week >= c.cohort_week;

The activity_week >= cohort_week filter makes sure we only look at activity after (or at) the user’s cohort week.

Step 5: Calculate weeks since first activity

DATE_DIFF gives us the distance between the activity week and the cohort week — this becomes our retention period index.

WITH first_activity AS (
  SELECT user_id, MIN(event_date) AS first_date
  FROM `your_project.analytics_demo.user_events`
  GROUP BY user_id
),
cohort AS (
  SELECT
    user_id,
    DATE_TRUNC(first_date, WEEK) AS cohort_week
  FROM first_activity
),
activity AS (
  SELECT
    user_id,
    DATE_TRUNC(event_date, WEEK) AS activity_week
  FROM `your_project.analytics_demo.user_events`
),
retention AS (
  SELECT
    c.cohort_week,
    DATE_DIFF(a.activity_week, c.cohort_week, WEEK) AS weeks_since_first_activity,
    c.user_id
  FROM cohort c
  JOIN activity a
    ON c.user_id = a.user_id
    AND a.activity_week >= c.cohort_week
)
SELECT
  cohort_week,
  weeks_since_first_activity,
  COUNT(DISTINCT user_id) AS retained_users
FROM retention
GROUP BY 1, 2
ORDER BY 1, 2;

Step 6: Add cohort size and retention rate

The last piece — divide retained users by cohort size to get the percentage.

WITH first_activity AS (
  SELECT user_id, MIN(event_date) AS first_date
  FROM `your_project.analytics_demo.user_events`
  GROUP BY user_id
),
cohort AS (
  SELECT
    user_id,
    DATE_TRUNC(first_date, WEEK) AS cohort_week
  FROM first_activity
),
activity AS (
  SELECT
    user_id,
    DATE_TRUNC(event_date, WEEK) AS activity_week
  FROM `your_project.analytics_demo.user_events`
),
retention AS (
  SELECT
    c.cohort_week,
    DATE_DIFF(a.activity_week, c.cohort_week, WEEK) AS weeks_since_first_activity,
    c.user_id
  FROM cohort c
  JOIN activity a
    ON c.user_id = a.user_id
    AND a.activity_week >= c.cohort_week
),
retention_count AS (
  SELECT
    cohort_week,
    weeks_since_first_activity,
    COUNT(DISTINCT user_id) AS retained_users
  FROM retention
  GROUP BY 1, 2
),
cohort_size AS (
  SELECT
    cohort_week,
    COUNT(DISTINCT user_id) AS cohort_users
  FROM cohort
  GROUP BY 1
)
SELECT
  r.cohort_week,
  r.weeks_since_first_activity,
  r.retained_users,
  c.cohort_users,
  ROUND(r.retained_users / c.cohort_users * 100, 2) AS retention_rate
FROM retention_count r
JOIN cohort_size c ON r.cohort_week = c.cohort_week
ORDER BY 1, 2;

Reading the Output

With our sample data, you’ll see something like:

cohort_weekweekretained_userscohort_usersretention_rate
2024-01-01022100%
2024-01-01122100%
2024-01-0121250%
2024-01-08022100%
2024-01-08122100%
2024-01-15022100%
2024-01-1511250%
  • Week 0 is always 100% — it’s the cohort’s starting point
  • Each subsequent week shows how many held on
  • Newer cohorts have fewer columns — there’s simply no future data yet, and that’s expected

Cohort retention results table and heatmap

Visualizing with a Heatmap

Export the result as CSV and pivot it in Python:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("retention.csv")

heatmap = df.pivot_table(
    values="retention_rate",
    index="cohort_week",
    columns="weeks_since_first_activity"
)

plt.figure(figsize=(12, 6))
sns.heatmap(heatmap, annot=True, fmt=".1f", cmap="YlGnBu")
plt.title("Weekly Cohort Retention (%)")
plt.tight_layout()
plt.show()

The heatmap makes drop-off patterns immediately obvious — you can spot at a glance whether Week 2 retention improved for newer cohorts compared to older ones.

Pitfalls

Grouping by event week instead of first activity week. A user who joined in January and is active in March belongs to the January cohort, not the March one. Always anchor on MIN(event_date).

Counting events instead of users. If a user triggers 10 events in a week, they should still count as 1 retained user. Use COUNT(DISTINCT user_id).

Treating empty cells as zero. The bottom-right triangle of a cohort table is empty because those weeks haven’t happened yet — not because users churned. Don’t fill them with zeros.

Mixing time granularities. If you truncate cohort assignment to weeks but activity to days (or vice versa), the join breaks. Keep both on the same grain throughout.