Looking for efficient solution to join two tables but with the caveat that characteristics of second table should determine what is joined to first table (in Google BigQuery).
Lets say I have two tables. One Table with events (id, session, event_date) and a second with policies applying to events (event_id, policy, create_date) and I want to determine which policy applied to an event based on the policy create date and the event date.
CREATE TEMP TABLE events AS (
SELECT *
FROM UNNEST([
STRUCT(1 AS id, "A" AS session, "2021-11-05" AS event_date),
(1, "B", "2021-12-17"),
(2, "A", "2021-08-13")
])
);
CREATE TEMP TABLE policies AS (
SELECT *
FROM UNNEST([
STRUCT(1 AS event_id, "foo" AS policy, "2021-01-01" AS create_date),
(1, "bar", "2021-12-01"),
(2, "foo", "2021-02-01")
])
)
In my example, the result should look like this if I get the latest policy_create_date that was in existence by the time of the event (enevt_date
).
id | session | policy_create_date |
---|---|---|
1 | A | 2021-01-01 |
1 | B | 2021-12-01 |
2 | A | 2021-02-01 |
The following solution would provide the result I want, but it create a N:N JOIN and can become quite big and calculation intense, if both tables get large (especially if I have many of the same events and many policy changes). Hence, I'm looking for a solution that is more efficient than the solution below and avoids the N:N JOIN.
SELECT
e.id,
e.session,
MAX(p.create_date) AS policy_create_date -- get latest policy amongst all policies for an event_id that existed before the session took place
FROM events e
INNER JOIN policies p
ON e.id = p.event_id -- match event and policy based on event_id
AND p.create_date < e.event_date -- match only policies that existed before the session of the event took place
GROUP BY 1, 2
TY!!!
Edit: I adjusted the known but inefficient solution to better reflect my goal. Of course, I want the policy in the end, but that is not in focus here.