List of user sessions in SQL

| category: Programming | author: st
Tags:

The frequently asked problem on interviews.

You have a log of some user activities represented as a table. Every activity record has at least users ID and activity date/time values.

The session is a sequence of activities having less than N minutes between two log records. When the elapsed time is greater than N minutes, the session is terminated; the new one is started.

The question is how to calculate the number of session per user.

Solution

The commonly proposed solution is based on using the LAG() window function which may not be supported by some DBMS. Therefore, the "old school pure SQL" solution still seems to be simple and is running everywhere.

Suppose the simplified log table user_activities which has only two required columns: user_id and activity_dt.

CREATE TABLE user_activities (
    user_id integer NOT NULL,
    activity_dt datetime NOT NULL
);

Insert the test data; I've also marked session breaks in the comments.

INSERT INTO user_activities (user_id, activity_dt)
VALUES
    -- user 1
    (1, '2021-03-05 10:05:00'),
    (1, '2021-03-05 10:10:00'), -- break 1
    (1, '2021-03-05 10:20:00'), -- 2
    (1, '2021-03-05 20:05:00'),
    (1, '2021-03-05 20:05:00'),
    (1, '2021-03-05 20:10:00'),
    (1, '2021-03-05 20:15:00'),
    (1, '2021-03-05 20:20:00'),
    (1, '2021-03-05 20:25:00'),
    (1, '2021-03-05 20:30:00'), -- 3
    (1, '2021-03-05 20:40:00'),
    -- user 2
    (2, '2021-03-05 10:05:00'),
    (2, '2021-03-05 10:08:00'),
    (2, '2021-03-05 10:10:00'), -- 1
    (2, '2021-03-05 10:12:00'),
    (2, '2021-03-05 10:20:00'), -- 2
    (2, '2021-03-05 20:10:00'),
    (2, '2021-03-05 20:15:00'),
    (2, '2021-03-05 20:20:00'),
    (2, '2021-03-05 20:25:00'),
    (2, '2021-03-05 20:30:00'), -- 3
    (2, '2021-03-05 20:40:00'), -- 4
    (2, '2021-03-05 20:46:00')
;

Now we can get the session count using CTE to simplify the query only. Note that I use the SQL Server specific function DATEDIFF which should be replaced by counterparts for other DBMS. E.g. TIMESTAMPDIFF() for MySQL, EXTRACT(EPOCH FROM... for PostgreSQL and so on.

Suppose also that the maximum delay breaking a session is 5 minutes.

WITH periods AS (
    SELECT
        ua1.user_id,
        ua1.activity_dt AS dt1,
        (
            SELECT MIN(ua2.activity_dt)
            FROM user_activities ua2
            WHERE ua2.user_id = ua1.user_id AND ua2.activity_dt > ua1.activity_dt
        ) AS dt2
    FROM user_activities ua1
)
SELECT user_id, COUNT(1) AS session_count
FROM periods
WHERE DATEDIFF(minute, dt1, dt2) > 5 -- max delay
GROUP BY user_id
;

Result

user_id     session_count
----------- -------------
1           3
2           4

(2 rows affected)