0

So I have 3 tables: Users, Purchases A, and Purchases B. I want to find all users that have made any purchase within the first 30 days of joining. So all users with a purchase from either Purchases A or Purchases B within the first 30 days after created_at.

A few things to note:

  • The Users table has 200+ million rows
  • Purchases A has billions of rows
  • Purchases B has billions of rows

What is the best way to approach this with tables this large? I know I can accomplish this with 2 left joins, but I don't need all of the rows from the Purchases tables, I just need to know if any exist.

SELECT
  user_id
  ,name
  ,created_at
FROM
  users u
LEFT JOIN purchases_a pa
ON u.user_id = pa.user_id AND TO_DATE(pa.purchase_date) <= DATEADD(DAY, 30, u.created_at)
LEFT JOIN purchases_b pb
ON u.user_id = pb.user_id AND TO_DATE(pb.purchase_date) <= DATEADD(DAY, 30, u.created_at)

Is a join even the right approach here or would IN or EXISTS be more appropriate?

enter image description here

country_dev
  • 605
  • 4
  • 13
  • 23
  • An `INNER` join _could_ be more performant than a `LEFT JOIN` given your data. Refer to the discussion (and upvoted answer) here https://stackoverflow.com/questions/2726657/inner-join-vs-left-join-performance-in-sql-server while that's for MSSQL, the same concept applies I believe (albeit, up to the interpreter). It could be worth it to switch to `INNER` and see if it's more performant with your dataset. – Edward Radcliffe Apr 21 '22 at 20:50
  • But I need to find users with purchases in ANY of the purchases tables. An inner join would only find users with purchase in ALL of the purchases table. – country_dev Apr 21 '22 at 21:23
  • Very true. I somehow read it needs to be in _both_. @Bill Karwin has a better answer anyway! – Edward Radcliffe Apr 21 '22 at 21:59

1 Answers1

2

Queries against data sets this large is bound to take a long time, no matter how you try to optimize it. You are doing a table-scan of 200 million users! That alone would be costly, to say nothing of the join.

The bottom line is that this is not a query to run against a normalized database of that scale, using a database technology optimized for OLTP work.

Maybe it could be done on a data warehouse, but not MySQL. Even on a data warehouse, you should transform the data to support that specific report.

One approach you could do (on MySQL or on a DW) would be to denormalize slightly, by adding an attribute to the users table that stores a boolean which is true for each user if they have made a purchase within 30 days of their creation date. Then you can index that boolean column. Your query can give you the result without examining 200 million rows.

When a user makes a purchase, your code would update the boolean column to true if it's within 30 days of the creation date.

You should also be thinking about sharding, so not all users and their purchases are stored in a single database. MySQL can handle billions of rows, but it's hard to optimize queries at that scale. Splitting the data over multiple MySQL servers will become necessary soon anyway. If you do that, then each server will only need to report on its own subset of the data, so you can run your report in parallel threads, and then combine the results.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828