I am trying to join table df1 with df2. The problem is that df2 is a very long table that have 40 million observations and I can't get the joined outcome in MySQL after waiting for more than 48 hours. So I want to ask if there is any way to improve the efficiency of this join. I already added an index on Tag column for both tables.
df1 and df2 have same structure and here is an example:
|Id |Tag
| -------- | --------------------------------------------
|1 |c#
|1 |winforms
|1 |type-conversion
|1 |decimal
|2 |.net
|2 |decimal
|3 |.net
|3 |math
Both tables have two columns, Id and Tag. However, neither Id or Tag are uniquely identified keys. Only Id+Tag can be a uniquely identified key. What I need is to left out join df1 with df2 on column Tag. And here is my code:
CREATE TABLE matched_outcome AS
SELECT df1.Id AS df1_Id, df2.Id AS df2_Id, COUNT(df2.Tag) AS overlapping
FROM df1
LEFT JOIN df2 ON df2.Tag=df1.Tag
GROUP BY df1.Id, df2.Id;