-1

I use XMLStreamer for reading the XML file (Structure here).

There are many products in 1 XML file (about 100,000 per file). The products have parameters, in total there are up to 500,000 of them in the file. There will be even around 20 XML files (maybe more, each with different products). Will the database be a good place to store such data?

Main problem:

The problem is the execution time of the script that enters the data into the database. Now with around 35,000 products, the time is around 300s. How to optimize it?

Code:

try {
    $start = time();

    $mysqli->begin_transaction();
    $stmtProducts = $mysqli->prepare("INSERT INTO products (shop_id, product_id, product_name) VALUES(?, ?, ?);");
    $stmtProducts->bind_param("iis", $shopId, $productId, $productName);

    
    $streamer = new XMLStreamer('offer', 'products', 'product');
    foreach ($streamer->stream('bds.xml') as $product) {
        $document = new \DOMDocument();
        $document->appendChild($product);
        $element = simplexml_import_dom($product);
        $productId = $element->attributes()->id;
        $productName = $mysqli->real_escape_string($element->description->name);
        $stmtProducts->execute();
        
    }
   
    $stmtProducts->close();
    $mysqli->commit();
    var_dump("Time [s]:  " . (time() - $start));
} catch (mysqli_sql_exception $exception) {
    $mysqli->rollback();
    throw $exception;
}
nxx
  • 11
  • 3
  • 1
    Have you tried looking at prepared statements for your insert operations? That way, your script/the database does not have to parse the same command over and over again, but just needs to do it. I also think there are commands that delays the rebuilding of indices to 'once you are done'. – Refugnic Eternium Aug 19 '22 at 12:51
  • Please consider using prepared statements either way. Even if they won't improve performance, they will make the code cleaner with less chances of SQL injection. – Dharman Aug 19 '22 at 13:16
  • How often are you running this? 300s once a month or even once a day isn't that big of a deal. – Chris Haas Aug 19 '22 at 13:21
  • Instead of executing each statement, you can try `multiple-row insert` to speed up. It can be once per table or per 10,000 records. Or use Prepare statement. Or use LOAD DATA to speed up the insert. Please check this - https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html – Ananth MK Aug 19 '22 at 13:40
  • Using the `prepared statement` I got the time reduction by about a minute. Thank you for your valuable advice! @ChrisHaas - is running once a day. – nxx Aug 19 '22 at 15:44
  • Please provide `SHOW CREATE TABLE` for the tables. The diagram you provided leaves out important info. – Rick James Aug 19 '22 at 16:29
  • 20% improvement using prepared statements -- thanks for the datapoint. – Rick James Aug 19 '22 at 16:37
  • 1
    you could also consider importing the xml lines one by one or in batch in a staging table and run a stored procedure with [XML functions](https://dev.mysql.com/doc/refman/5.7/en/xml-functions.html) to parse the xml lines and insert into the correct tables .. plus if you sha512 the lines so you can run and import / update only changes between the day before and the current day assuming you also have to update the information. – Raymond Nijland Aug 20 '22 at 13:42

1 Answers1

1

Batch insert.

Instead of inserting one row at a time, insert 100 in a single INSERT. This may shrink the 300 seconds to 30.

Re-think the PRIMARY KEY of the tables. In order to do IODKU, you need to not specify the set of unique columns in the UPDATE part while specifying them in the INSERT part. I would guess that some combination of the columns is really the PK.

More

It is quite strange to have

id
xx_id
xx_name

If xx_id is the id for xx_name, then you don't need id If you also have

foo_id

then why have xx_id.

There are two main cases:

A simple mapping between a long string and a short "id" representing the string. There may also be other info:

country_code CHAR(2),
country_name VARCHAR(100),
capital ...

In this case, country_code is used in other tables, and a JOIN is used to get the name for display purposes. Or the name is used to get the matching code.

Another case:

company_id SMALLINT
branch_id TINYINT
number_of_employees SMALLINT

In this case, a composite PK seems useful: PRIMARY KEY(company_id, branch_id)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I added table structures to the first post and updated PHP code, it is now using `prepare statements`. `PK` is as separate column, because the table can repeat eg product_id, parameter_id, value_id (there are different products for each shop). So I decided to make a separate column for row identification. Is the database structure correct? Is it possible to enter 100 lines in 1 query using a prepared statement? – nxx Aug 20 '22 at 13:43
  • I don't know of a good PHP way to use "prepare" for that. I wish there were. I added some other stuff to my Answer; you seem to have some misconceptions about ids. – Rick James Aug 20 '22 at 14:29