0

Looking for efficient solution to join two tables but with the caveat that characteristics of second table should determine what is joined to first table (in Google BigQuery).

Lets say I have two tables. One Table with events (id, session, event_date) and a second with policies applying to events (event_id, policy, create_date) and I want to determine which policy applied to an event based on the policy create date and the event date.

CREATE TEMP TABLE events AS (
    SELECT *
    FROM UNNEST([
        STRUCT(1 AS id, "A" AS session, "2021-11-05" AS event_date),
        (1, "B", "2021-12-17"),
        (2, "A", "2021-08-13")
    ])
);
CREATE TEMP TABLE policies AS (
SELECT *
    FROM UNNEST([
        STRUCT(1 AS event_id, "foo" AS policy, "2021-01-01" AS create_date),
        (1, "bar", "2021-12-01"),
        (2, "foo", "2021-02-01")
    ])
)

In my example, the result should look like this if I get the latest policy_create_date that was in existence by the time of the event (enevt_date).

id session policy_create_date
1 A 2021-01-01
1 B 2021-12-01
2 A 2021-02-01

The following solution would provide the result I want, but it create a N:N JOIN and can become quite big and calculation intense, if both tables get large (especially if I have many of the same events and many policy changes). Hence, I'm looking for a solution that is more efficient than the solution below and avoids the N:N JOIN.

SELECT
  e.id,
  e.session, 
  MAX(p.create_date) AS policy_create_date -- get latest policy amongst all policies for an event_id that existed before the session took place
FROM events e
INNER JOIN policies p
  ON e.id = p.event_id -- match event and policy based on event_id
  AND p.create_date < e.event_date -- match only policies that existed before the session of the event took place
GROUP BY 1, 2

TY!!!

Edit: I adjusted the known but inefficient solution to better reflect my goal. Of course, I want the policy in the end, but that is not in focus here.

Racooneer
  • 329
  • 1
  • 2
  • 11
  • Do you really want the full result of your table or will you be limiting the query? As a rule I would avoid having a CTE with a much larger result than the output. Dropping the CTE and using a lateral join with something like (SELECT * FROM policies as p WHERE ... ORDER BY p.create_date DESC LIMIT 1) could work. – Hinni Mar 13 '22 at 16:47
  • Please explain the condition for the join. –  Mar 13 '22 at 20:29
  • @Hinni sorry for the confusion. I edited the post earlier and in result, it had confusing parts. | I would indeed filter the part of the "events" prior to the join in my example. However, the condition of the real data omits to filter the policies – Racooneer Mar 14 '22 at 05:49
  • @Kendle I added some comments to the query. Hope it makes it more clear. Thanks for considering – Racooneer Mar 14 '22 at 08:08
  • @Kendle could you elaborate your point please? – Racooneer Mar 15 '22 at 15:16
  • I've checked and the indexes question does not apply to Google BigQuery. I've deleted my comment. Sorry about the red herring –  Mar 15 '22 at 15:20
  • Have you evidenced that your query really is an issue? SQL isn't an imperative language, it's declarative; you may find that the optimiser isn't doing what you think it is. You should profile this with realistic data for stress testing, and only when an issue materialises should you optimise it (anything else is by definition premature optimisation). If you have profiled this with representative data, then you should include the profiling statistics in your question, at the very least the full explain plan. – MatBailie Mar 26 '22 at 14:11

4 Answers4

1

You can try the window function

WITH cte AS (
  SELECT e.id, e.session, p.policy
   , row_number() over(partition by e.id, e.session order by p.create_date desc) rn
  FROM events e
  INNER JOIN policies p
  ON e.id = p.event_id AND p.create_date < e.event_date
)
SELECT c.id, c.session, c.policy
FROM cte c
where rn=1
Serg
  • 22,285
  • 5
  • 21
  • 48
  • Hi, thanks for the suggestion. That would avoid the join after the CTE. However, I'm trying to avoid the N:N Join in the CTE. I'll adjust my question. – Racooneer Mar 13 '22 at 16:20
1

I have tried the following code on Postgres, but there shouldn't be anything in there that is postgres specific.

Your query can be reorganised using a subquery to:

SELECT
  e.id,
  e.session, 
  (SELECT MAX(create_date) FROM policies AS p WHERE e.id = p.event_id AND p.create_date < e.event_date) AS policy_create_date
FROM events e
WHERE policy_create_date IS NOT NULL

While this query should show similar performance it makes it easier to spot the problem with the overall query: While finding the MAX the database has already found and read the row from policies with the highest date, but you are not getting the the value of the policy column out. So, you need to do a second join.

Using a lateral join you can get the complete relevant row from policies in one go.

SELECT
  e.id,
  e.session,
  p2.policy,
  p2.create_date
FROM events AS e
INNER JOIN LATERAL
   (SELECT
     * 
    FROM policies AS p 
    WHERE e.id = p.event_id AND p.create_date < e.event_date 
    ORDER BY p.create_date DESC
    LIMIT 1) AS p2 
ON TRUE;

This should use an index on policies. So, time should increase linear with size of events and logarithmic with size of policies.

Nevertheles, you can't expect great performance when you do this for large resultsets, because there will be lots of cache-misses while accessing the policies table.

Hinni
  • 166
  • 5
  • BigQuery doesn't support LATERAL or APPLY. This question appears to address a workaround; https://stackoverflow.com/questions/66244399/outer-apply-in-bigquery – MatBailie Mar 26 '22 at 14:21
  • It is actually described in the official google documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax?hl=de#correlated_join – Hinni Mar 27 '22 at 19:17
0

Another option is to interleave the two tables, then use LAST_VALUE() to look back to find the policy data...

WITH
  interleave AS
(
  SELECT
    id          AS event_id,
    event_date  AS event_date,
    session     AS event_session,
    NULL        AS policy_label,
    NULL        AS policy_date
  FROM
    events

  UNION ALL

  SELECT
    event_id,
    create_date,
    NULL,
    policy,
    create_date
  FROM
    policies
),
  lookback AS
(
  SELECT
    event_id,
    event_session,
    event_date,
    LAST_VALUE(policy_label IGNORE NULLS) OVER event_order   AS policy_label,
    LAST_VALUE(policy_date  IGNORE NULLS) OVER event_order   AS policy_date
  FROM
    interleave
  WINDOW
    event_order AS (
      PARTITION BY event_id
          ORDER BY event_date,
                   event_session NULLS FIRST
      ROWS BETWEEN UNBOUNDED PRECEDING
               AND         1 PRECEDING
    )
)
SELECT
  event_id,
  event_session,
  event_date,
  policy_label,
  policy_date
FROM
  lookback
WHERE
  event_session IS NOT NULL

This presumes that the events table is vastly larger than the policies table.

I'd also recommend ensuring the tables are partitioned by the event_id and clustered by their respective date column.

MatBailie
  • 83,401
  • 18
  • 103
  • 137
0

Another option is to use LEAD() to find a policy's "expiry" date, then use that in the join...

WITH
  policy_range AS
(
  SELECT
    event_id,
    policy,
    create_date,
    LEAD(create_date, 1, DATE '9999-12-31') OVER event_order  AS expiry_date
  FROM
    policies
  WINDOW
    event_order AS (
      PARTITION BY event_id
          ORDER BY create_date
    )
)
SELECT
  e.id,
  e.session,
  e.event_date,
  p.policy,
  p.create_date
FROM
  policy_range  AS p
INNER JOIN
  events        AS e
    ON  e.id          = p.event_id
    AND e.event_date >  p.create_date
    AND e.event_date <= p.expiry_date
MatBailie
  • 83,401
  • 18
  • 103
  • 137