0

I am trying to join table df1 with df2. The problem is that df2 is a very long table that have 40 million observations and I can't get the joined outcome in MySQL after waiting for more than 48 hours. So I want to ask if there is any way to improve the efficiency of this join. I already added an index on Tag column for both tables.

df1 and df2 have same structure and here is an example:

|Id        |Tag
| -------- | --------------------------------------------
|1         |c#
|1         |winforms
|1         |type-conversion
|1         |decimal
|2         |.net
|2         |decimal
|3         |.net
|3         |math

Both tables have two columns, Id and Tag. However, neither Id or Tag are uniquely identified keys. Only Id+Tag can be a uniquely identified key. What I need is to left out join df1 with df2 on column Tag. And here is my code:

CREATE TABLE matched_outcome AS
SELECT df1.Id AS df1_Id, df2.Id AS df2_Id, COUNT(df2.Tag) AS overlapping
FROM df1
LEFT JOIN df2 ON df2.Tag=df1.Tag
GROUP BY df1.Id, df2.Id;
Xiaomeng
  • 43
  • 8
  • 1
    Add indexes on those columns. That will help. Before you do, run EXPLAIN PLAN and look for a table scan. If you see one, you'll know adding the index will help. – duffymo Jul 25 '17 at 00:47
  • @duffymo Sorry I forgot to mention before, I already added an index on Tag column for both tables. – Xiaomeng Jul 25 '17 at 00:58
  • 40 million rows? You're screwed. You need to filter on that big table to cut down the size of the JOIN. – duffymo Jul 25 '17 at 01:09

2 Answers2

1

I would try composite index on (tag, id) for both tables in this order.

Check using execution plan if this index gets picked up.

Partitioning table on tag column might help, as partition internally creates multiple smaller tables.

Also you should run this and check where your process gets stuck SHOW FULL PROCESSLIST

That might give you further leads.

Vijay Agrawal
  • 1,643
  • 12
  • 17
0

"inflate-deflate syndrome" happens when you do JOIN...GROUP BY. And it usually lead to wrong (high) values for aggregates (COUNT, SUM, etc).

I am too baffled about what the query is supposed to do to help you rewrite it. Can you elaborate? Why LEFT? What is "overlapping"?

Sometimes the workaround is to do something like:

SELECT df1.some_stuff,
       ( SELECT COUNT(*) FROM df2 WHERE Tag = df1.Tag ) AS overlapping
    FROM df1;

Does this even come close to what you want?

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • what I am actually trying to do is this: https://stackoverflow.com/questions/45024037/how-to-merge-by-a-column-of-collection-using-python-pandas. Initially I was trying to use Python pandas proposed by the accepted answer. However, the last step (match df1 with df2) was problematic. So I decided to use MySQL to do the last step. – Xiaomeng Jul 25 '17 at 13:10
  • `` Another case of 3rd party software getting in the way? `` – Rick James Jul 25 '17 at 17:23