Query is taking 8 hours for checking and inserting against 3 millions of data

Question

I get new 10000s of xml files data everyday.

and I always have run a query to see if there is any new data into those XML files and if that doesn't exists into our database then insert that data into our table.

Here is the code

if(!Dictionary::where('word_code' , '=' , $dic->word_code)->exists()) {
    // then insert into the database.
}

where $dic->word_code is coming from thousands of XML files. every time it opens up the new XML file one by one then check this record exists then open a new XML file and check if it doesn't exists then insert the record then move to another file and do the same procedure with 10000 XML of files.

each XML file is about 40 to 80mb which has lots of data.

I already have 2981293 rows so far and checking against 2981293 rows with my XML files then inserting the row seems to be really time-consuming and resource greedy task.

word_code is already index.

The current method takes about 8 hours to finish up the procedure.

By the way I must mention this that after running this huge procedure of 8 hours it downloads about 1500 to 2000 rows of data per day.

I get a lot of duplication so I have to make sure before inserting. — Jonib, Jul 17 '20 at 07:50
That's not a good reason. Your schema should prevent duplication, or otherwise know how to handle it. — Strawberry, Jul 17 '20 at 08:22

score 1 · Answer 1 · answered Jul 16 '20 at 21:49

1

Comparing the file to the database line by line is the core issue. Both the filesystem and databases support comparing millions of rows very quickly.

You have two options.

Option 1: Keep a file backup of the previous run to run filesystem compare to find differences in the file.

Option 2: Load the XML file into a MySQL table using LOAD DATA INFILE. Then run a query on all rows to find both new and changed rows. Be sure to index the table with a well defined unique key to keep this query efficient.

answered Jul 16 '20 at 21:49

yg-dba

330
1
6

Based on trincot's response below, you can use LOAD DATA INFILE to append directly to the result table, if the result table matches the format of the XML. However, if there is an ETL process or other logic that occurs when a new row is found, this may not work. – yg-dba Jul 16 '20 at 21:52
these are thousands of xml files i can't load them data infile. my script has to compare those xml files everyday then only insert the new data and ignore the duplication. – Jonib Jul 17 '20 at 11:49
Precisely because you have thousands of xml files, you should load them into the database using LOAD DATA INFILE. This method of loading data into MySQL is exponentially faster than doing multiple INSERTs for each row. You can load the data into a staging table, and then compare for differences, or INSERT IGNORE. – yg-dba Jul 20 '20 at 02:22

trincot · Accepted Answer · 2020-07-17T19:38:41.483

0

I would split this job into two tasks:

Use your PHP script to load the XML data unconditionally in a temporary table that has no constraints, no primary key, no indexes. Make sure to truncate that table before loading the data.
Perform one single INSERT statement, to merge records from that temporary table into your main table, possibly with an ON DUPLICATE KEY UPDATE or IGNORE option, or otherwise a negative join clause. See INSERT IGNORE vs INSERT … ON DUPLICATE KEY UPDATE

For the second point, you could for instance do this:

INSERT IGNORE 
    INTO   main 
    SELECT * 
    FROM   temp;

If the field to compare is not a primary key in the main table, or is not uniquely indexed, then you might need a statement like this:

INSERT INTO   main   
    SELECT    temp.*
    FROM      temp 
    LEFT JOIN main m2 
           ON m2.word_code = temp.word_code
    WHERE     m2.word_code is NULL;

But this will be a bit slower than a primary-key based solution.

edited Jul 17 '20 at 19:38

answered Jul 16 '20 at 21:48

trincot

317,000
35
244
286

that makes more sense, i am gonna try this and see what i get. – Jonib Jul 17 '20 at 08:28
any example of merging from temp table to main table and updating duplicate records but inserting new records? – Jonib Jul 17 '20 at 16:47
See this [fiddle](https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=86fe0739fbdefb1f44417219915915ea) – trincot Jul 17 '20 at 17:16
Thanks for the help. I see you are mapping the records through the ID column. so ID has to be the same in both tables – Jonib Jul 17 '20 at 18:54
It is just an example. In your case you would probably use `word_code`. Use whatever identifies that two records are actually the same thing, and therefore the loaded record should be ignored. If it is not the primary key, then create a unique index for it on your main table, if not already the case. – trincot Jul 17 '20 at 18:57
If you compare with a non-primary key, you could also use a negative join, which will be slower. Like [here](https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=6810c627a7ca1d1d8e225664b58c1bfc) – trincot Jul 17 '20 at 19:06
Both of your solutions are working fine for me. I just had to apply a unique index to the main table plus auto increment for my temporary table. i like the first method but i am still not sure how it is comparing temp table with the main table only having the unique index to the main table without manually defining the where condition but you have already reduced 50% of my code now. cheers – Jonib Jul 17 '20 at 19:32
That first method *intends* to just add each and every record from `temp` into `main`. But as the unique index will sometimes spot a violation of uniqueness, the `ignore` keyword suppresses the error message that would normally be spit out, and the statement happily continues inserting the other records that do not conflict with the index. – trincot Jul 17 '20 at 19:37
1

this is the best answer I could ever accept. perfect! you have sent me back to school. – Jonib Jul 17 '20 at 19:43
just want to let you know that by your method. I got rid of nested queries loops and 8 hours time is reduced to 1 hour only. – Jonib Jul 18 '20 at 13:58
Thank you for the feed-back! Good to hear that it was useful. – trincot Jul 18 '20 at 14:02

Query is taking 8 hours for checking and inserting against 3 millions of data

2 Answers2