In mysql Distrib 5.5.49, the goal here is to generate Table C efficiently from the input, Table 0. Table 0 needs some calculation to generate Table A, and it will be joined by Table B to generate Table C.
However, the whole process is a bit slow for the big join. Therefore, we are trying to "recycle" data (copy all the columns except the ID column) when the data contents with different IDs are the same, in the hope that the new approach will be more efficient. However, in the following test set and pseduo code, we found the old approach is actually about 50% faster than the new, recycling data approach.
Old approach (one INNER JOIN on bigger tables):
Input:
Table 0, which contains 2,000 rows. Some calculation processes will
generate 1 million rows Table A from Table 0.
Then
1 million rows Table A
INNER JOIN
200K rows Table B
On ID column, which are indexed in both Table A & Table B
=> generate 2 million rows Table C
------------------------------
New approach: (separate inputs , INNER JOIN on smaller tables,
then UNION and RIGHT JOIN on smaller tables to get bigger tables)
Input: Separate Table 0 into
Table 1, 1,000 rows. Some calculation processes will generate
1 million rows Table A from Table 0.
Table 2: 2,000 rows
Create Table C AS (
0.5 million rows Table A
INNER JOIN
200K rows Table B
On ID column, which are indexed in both Table A & Table B)
=> generate 0.6 million rows Table C;
Create Table D AS (
0.6 million rows Table C
UNION
1000 rows Table 2
RIGHT JOIN 0.6 million rows Table C
On ID column, which are indexed in both Table A & Table B
=> generate 2 million rows Table D);
DROP TABLE 0.6 million rows C ();
ALTER TABLE TABLE D RENAME TO TABLE C;
The SQL fiddle of the new approach is available here (http://sqlfiddle.com/#!9/91c1cf/5)
Could any guru kindly offer some comments on how to optimize the new approach?
I am assuming the new approach might be faster when the input file (Table 0) has more replicating data where we can separate and recycle them. However, when we test with a bigger input, the system constantly failed with an error message:
Incorrect key file for table '/tmp/#sql_3c51_0.MYI'; try to repair it
According to this post, this error message means the system doesn't have enough /tmp folder for the temporary tables.
Incorrect key file for table '/tmp/#sql_3c51_0.MYI'; try to repair it
I am still trying to following the instruction in this post. Any 2 cents might be helpful!