0

I have a php script which iterates through a JSON file line by line (using JsonMachine), checks each line for criteria (foreach); if criteria are met it checks if it's already in a database, and then it imports/updates (MYSQL 8.0.26). As an example last time this script ran it iterated through 65,000 rows and imported 54,000 of them in 24 seconds.

Each JSON row has a UUID as unique key, and I am importing this as a VARCHAR(36). I read that it can be advantageous to store the UUIDs as BINARY(16) using uuid_to_bin and bin_to_uuid so I coded the script to store the UUID as binary and the read php scripts to unencode back to UUID, and the database fields to BINARY(16).

This worked functionally, but the script import time went from 24 seconds to 30 minutes. The server was not CPU-bound during that time, running at 25 to 30% (normally <5%).

The script without uuid conversion runs at about 3,000 lines per second, using uuid conversion it runs at about 30 lines per second.

The question: can anyone with experience on bulk importing using uuid_to_bin comment on performance?

I've reverted to native UUID storage, but I'm interested to hear others' experience.

EDIT with extra info from comments and replies:

  • The UUID is the primary key
  • The server is a VM with 12GB and 4 x assigned cores
  • The table is 54,000 rows (from the import), and is 70MB in size
  • Innodb buffer pool size is not changed from default, 128MB: 134,​217,​728
simonw
  • 1
  • 2
  • show your code, show your schema please. They are the witness of what you are trying to describe. in my experience, 30 mins for 54K records is an eternity. – YvesLeBorg Jan 24 '22 at 23:09
  • Is this [key a PK](https://stackoverflow.com/a/7578500/231316) or just a regular key? – Chris Haas Jan 24 '22 at 23:26

1 Answers1

0

Oh, bother. UUID_TO_BIN changed the UUID values from being scattered to being roughly chronologically ordered (for type 1 uuids). This helps performance by clustering rows on disk better.

First, let's check the type. Please display one (any one) of the 36-char uuids or HEX(binary) using the 16-byte binary version. After that, I will continue this answer depending on whether it is type 1 or some other type.

Meanwhile, some other questions (to help me focus on the root cause):

  • What is the value of innodb_buffer_pool_size?
  • How much RAM?
  • How big is the table?
  • Were the incoming uuids in some particular order?

A tip: Use IODKU instead of SELECT + (UPDATE or INSERT). That will double the speed.

Then batch them 100 at a time. That may give another 10x speedup.

More

  • Your UUIDs are type 4 -- random. UUID_TO_BIN() changes from one random order to another. (Dropping from 36 bytes to 16 is still beneficial.)
  • innodb_buffer_pool_size -- 128M is an old, too small, default. If you have more than 4GB, set that to about 70% of RAM. This change should help performance significantly. Your VM has 12GB, so change the setting to 8G. This will eliminate most of the I/O, which is the slow part of SQL.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • One of the UUIDs: 0000579f-7b35-4ed3-b44c-db2a538066fe 12GB RAM, 4 x virtual cores Table is 54,000 rows (all imported under this process), 70MB No, the imported UUIDs are not in any particular order – simonw Jan 26 '22 at 04:20
  • ...and 128MB default buffer size – simonw Jan 26 '22 at 04:29
  • Re IODKU instead of Select +(update or insert) - yep, good call... although half of 30 minutes is still 15 mins compared to 25 seconds... At the moment I'm leaning that the incremental performance reduction from just using native UUIDs is an acceptable trade-off. – simonw Jan 26 '22 at 05:26
  • @simonw - I added More. – Rick James Jan 26 '22 at 05:47