Pedram's Data Based

Share this post

Counting Things: Counting Users Part 2

pedram.substack.com
Deep Dives

Counting Things: Counting Users Part 2

come on get a little bit closer baby, cause tonight is the night

Pedram Navid
Aug 9, 2022
7
1
Share this post

Counting Things: Counting Users Part 2

pedram.substack.com

In my last post, I walked through an example of counting people who visit your site, and the complexities that come with it. Next, we’ll explore what happens when a visitor becomes a user and emits a couple innocuous events.

When Two Become One

Let’s say Rachel visited your site over the past few months. For simplicity, she was kind enough to persist cookies, use the same device across both sites, and generally be friendly toward your site tracking. We are using something like Rudderstack, Segment, Amplitude, Mixpanel, Jitsu, or Snowplow for event tracking.

Rachel clicks the giant, blinking, iridescent ✨sign-up✨ button your growth team so thoughtfully placed in the middle of your website. She signs up with her email address and creates a password. Somewhere, a growth marketer wakes from her dreams. Success.

If your engineering team was kind and generous, they also instrumented the sign-up event and the subsequent sign-in, and now you have three types of events, and they might look like this.

Our tracked events table, (let’s call it tracks), looks like this:

anonymous_id | event_name | event_time | user_id  | source
123          | viewed_page| 2022-06-01 | null     | web
123          | signed_up  | 2022-08-01 | aaaa-111 | nodejs
123          | signed_in  | 2022-08-01 | aaaa-111 | nodejs

A few things to note:

  1. Your old rows don’t get updated when new information arrives. In June, we didn’t know the user id of our visitor, since they had not signed up. But in August we did have that information.

  2. Events take place in different contexts. The first event was emitted from your marketing website on the front-end. The second and third were server-side events from the backend, directly to your event-tracker. The first event may not always fire, depending on ad-blocking, network blips or browser behaviour.

These nuances will make the lives of your data practitioners hard, so it’s important to have lots of sympathy and moral support for them when they inevitably start working on sessionization. Help is available, and they are not alone.

What Can We Do With Events?

Given just the three events, we can ask many different types of questions:

Attribution

What are the leading sources of user sign ups?

For people who signed up, what was the first page they visited on our marketing site? Or the last?

Adoption

How many people signed-up for my product each day?

How long does it take for an average visitor to sign-up?

What percent of visitors end up signing up for our product, and how does that change over time?

Engagement

How many people who sign-in to our product every day are new users?

How many of them are existing?

How many users stopped signing in?

How many users came back after a break?


Stitching User Events

Before we can start chipping away at our newly formed backlog of questions we still have to solve the fundamental problem of user stitching. We want to associate every event we have with the user id, even if the user was not known until later.

Given the simplified example above, we can create a mapping of anonymous_id → user_id by using a window function.

For simplicity, I’m using Snowflake syntax, with other implementations you may need to specify: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING to look past the current row for the last value.

If you’re new to window functions, this can look daunting. Think of a window as a slice of a table. We want to operate on every row that has the same anonymous_id. In each slice, apply a function to get a result, and add it as a new column. In this case, we’re applying the last function, which finds the last row in that window.

Here’s a little illustration of how a window function might work. Start by taking the partition highlighted in yellow, then within each partition order by timestamp, and then take the last value in that partition, and use that as the result that fills each row.

Once the events have been stitched, we have our desired output: a mapping of anonymous id to user id.

Conversely, we also know that any anonymous id that isn’t mapped to a user id has not signed up, or cannot otherwise be identified.

anonymous_id | user_id 
123          | aaaa-111
124          | aaaa-111
125          | aaaa-111
234          | bbbb-222 
789          | NULL      <- this person has never signed up

With the above, we can now start to chip away at our questions from before.

Liking what you read? Data Based is only possible because of the support of subscribers. To receive new posts and support my work, consider becoming a free or paid subscriber.

Attribution

What are the leading sources of user sign ups?

For people who signed up, what was the first page they visited on our marketing site? Or the last?

What we know:

  • all the visitors to our websites and how they got there (from Counting Users, Part 1)

  • which visitors eventually became users

So naturally, we can look at where visitors came from before signing up to understand where our sign-ups come from.

There are many models for attributing sign-ups to visits (all of them bad, some of them useful). The simplest ones look at either the first, or last thing someone did before they performed a conversion. Consider the following events:

anonymous_id| path   | utm        | referrer | event_time
123         | /      | cpc-google | google   | 2 days ago
123         | /blog/ | NULL       | bing     | yesterday

We start by joining anonymous_id above to a stitching table we created in the previous step, and let’s say we find that the anonymous user 123 is actually user aaaa-1111.

Pretend we also have a table that tells us when the user signed-up, and it was today. We can either give credit to a cost-per-click advertisement (first-touch attribution) or to our blog (last-touch attribution).

We can get more complex if we wish. Maybe we only want to look back a certain number of days. For example, does it make sense to give credit to a paid ad from 18 months ago if someone signs up today?

We might want to categorize different types of web traffic according to some rules by bucketing similar traffic together, such as social media sources. (There’s a great dbt seed file in the segment package that helps with this)

We may want to go beyond attributing conversion events, to better understand what brings new visitors to our site for the first time. For every user, we could look at the first page they visited and categorize that traffic to understand ‘landing pages’.

Hard to believe, but the answers to every single one of these questions starts with just a few events and stitching.

Adoption

How many people signed-up for my product each day?

How long does it take for an average visitor to sign-up?

What percent of visitors end up signing up for our product, and how does that change over time?

Understanding adoption is also made possible by the same types of events we used for attribution. If we look only at the sign-in events, we can count how many people visit our site like so:

select 
date_trunc('day, timestamp) as event_day,
count(distinct user_id) 

from tracks
group by event_day

This query counts the number of distinct users within a specified time period. We use count distinct because a single user often has multiple events a day.

If we want to know how long it takes for a visitor to sign-up, we can look at the time elapsed between their first visit, and their sign-up event.

with conversions as (
  /* Assume one signed-up event per user for simplicity */
    select
    user_id,
    timestamp as signup_date

    from tracks 
    where event_name = 'signed-up'

)

select distinct

user_id,
first_value(timestamp) over(partition by stitched.user_id) as first_event_date,
signup_date,
datediff('days', first_event_date, signup_date) as days_to_signup

from tracks 
join stitched using(anonymous_id)

We use a window function again, this time to get the first event. We count the days between the first event and the conversion to see how long it takes for someone to sign up.

We can also perform a very rudimentary funnel analysis by counting the number of new visitors and sign-ups each day.

To help us, let’s imagine a helper column called blended_user_id. It is the user id if it’s known, or the anonymous id if not.

We find the first event ever for a particular blended user id, and then find the first sign-up event for each user. Count the number of times each of those events happen, every day, and get a funnel count of visitors → users.

with visitors as (
    select 

    date_trunc('days', timestamp) as day,
    count(distinct blended_user_id) as new_visitors

    from stitched_tracks
    group by day
    qualify row_number() over(partition by anonymous_id order by timestamp) = 1
),

signups as (

    select 

    date_trunc('days', timestamp) as day,
    count(distinct blended_user_id) as new_signups

    from stitched_tracks
    where event_name = 'signed-up'
    group by day
    qualify row_number() over(partition by anonymous_id order by timestamp) = 1
),

select 

day,
new_signups,
new_visitors

from visitors
full join signups using (day)

Engagement

How many people who sign-in to our product every day are new users?

How many of them are existing?

How many users stopped signing in?

How many users came back after a break?

We can even start to get into some fun churn and retention analysis. One really simple (and not useful) way to measure churn might be to count:

  • Anyone who signed in today that signed in yesterday (retention)

  • Anyone who signed in yesterday that didn’t sign in today (churn)

We’re using some really fun SQL now, by joining a single table to itself and offsetting the day in the join condition.

with daily_activity as (
  select distinct
    date_trunc('day', timestamp) as day,
    user_id
  from tracks
  where user_id is not null
),

retained as (
select
  today.day,
  count(distinct today.user_id) as retained
from daily_activity today
join daily_activity yesterday
  on today.user_id = yesterday.user_id
  and today.day = yesterday.day + interval 1 DAY
group by today.day
),

churned as (
select
  yesterday.day + interval 1 DAY as day,
  count(distinct yesterday.user_id) as churned
from daily_activity yesterday
left join daily_activity today
  on today.user_id = yesterday.user_id
  and today.day =  yesterday.day + interval 1 DAY,
where today.user_id is null
group by 1
)

select 
day,
coalesce(retained, 0) as retained,
coalesce(churned, 0) as churned

from retained
full join churned using (day)
order by 1

This example was taken with great inspiration from the Sisense blog, so feel free to give it a read to really understand what’s going on. Don’t sweat if this one makes your head hurt, the goal here is really to show you how much you can do with just a couple events.

I hope this was a useful foray into the depths you can go to with event streams. The world only gets more complicated from here as you try to do things like tie ad spend to revenue by connecting Salesforce Accounts to Product Signups through intermediary tables. Yuck! Let’s pretend we never spoke of such things.

Thank you for reading Data Based! If you enjoyed this post, it would mean a lot if you shared it!

Share


Did you enjoy this post? Do you have ideas for future metrics to cover? Maybe you think Cohort Analysis is something you’ve always wanted to learn more, or you think there’s nothing hotter than a well-defined activation metrics. Well, leave a comment or drop me an email!

1
Share this post

Counting Things: Counting Users Part 2

pedram.substack.com
Previous
Next
1 Comment
Rafael Barbosa
Aug 15, 2022

A cookie may have multiple users associated with it. It seems to be a common choice[0] to use the "last_value". But why? Wouldn't it be better to use the first user ID, that way you guarantee that your mapping/stitching table never changes. As a consequence, downstream metrics will not jump around if a 2nd, 3rd, ... user gets associated with the same cookie.

[0] Snowplow does the same: https://github.com/dbt-labs/snowplow/blob/main/models/identification/default/snowplow_id_map.sql

Expand full comment
Reply
TopNewCommunity

No posts

Ready for more?

© 2023 Pedram Navid
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing