2

I want to test high-intensive write between InnoDB and MyRock engine of the MySQL database. For this purpose, I use sysbench to benchmark. My requirements are:

  • multiple threads concurrency write to the same table.
  • support batch insert (each insert transaction will insert bulk of records)

I check all pre-made tests of sysbench and I don't see any tests that satisfy my requirements.

  • oltp_write_only: supports multiple threads that write to the same table. But this test doesn't have bulk insert option.
  • bulk_insert: support multiple threads, but each thread writes to a different table.

Are there any pre-made sysbench tests satisfied my requirement? If not, can I find custom Lua scripts somewhere which already are done this?

(from Comment:)

CREATE TABLE IF NOT EXISTS `tableA` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT, 
    `user_id` VARCHAR(63) NOT NULL DEFAULT '', 
    `data` JSON NOT NULL DEFAULT '{}', 
    PRIMARY KEY (`id`), 
    UNIQUE INDEX `user_id_UNIQUE` (`user_id` ASC)
) ENGINE = InnoDB;
Rick James
  • 135,179
  • 13
  • 127
  • 222
Trần Kim Dự
  • 5,872
  • 12
  • 55
  • 107
  • Are you `INSERTing` into InnoDB? (Or into MyRocks?) By "batch insert" do you mean a bunch of 1-row `INSERTs` in a transaction? Or single `INSERT` statements, each will multiple rows? (with or without transactions) – Rick James Jul 03 '19 at 16:48
  • I try to insert both to InnoDB and MyRocks. Batch insert means: Single insert, each with multiple rows in one statement. As this link: https://stackoverflow.com/questions/5526917/how-to-do-a-batch-insert-in-mysql – Trần Kim Dự Jul 03 '19 at 17:33
  • I don't know how to use sysbench for such. For InnoDB, I do recommend between 100 and 1000 rows per batch `INSERT` and have each such `INSERT` be a separate transaction. – Rick James Jul 03 '19 at 19:34
  • @RickJames thanks for the recommendation. In case we cannot test batch insert, can we implicit assume correlation of performance between InnoDB and RocksDB for batch insert while only test point insert ? (insert single row at one transaction). – Trần Kim Dự Jul 03 '19 at 19:41
  • I have no knowledge of RocksDB. For InnoDB, a 100-row or 1000-row insert is about 10 times as fast (per row) as a 1-row insert. (It is non-linea). – Rick James Jul 03 '19 at 19:44
  • How many rows per second do you expect to insert? How many terabytes of disk space will you ultimately fill? (Do the math!) InnoDB can do at least 1K rows/sec, which will fill a terabyte in a few months. Also, if you intend to purge "old" data after some number of days, we need to discuss that performance issue. – Rick James Jul 04 '19 at 21:57
  • The way I need high-intensive write because very specific requirement: Every day, I have a huge volume of data that need to overlay insert to the table. Let us say it is table A. I will insert into table B, then index data on table B, then swap table name between A and B. This operator needs to do quick, because I often do as night, and it happens at the end of the pipeline. – Trần Kim Dự Jul 05 '19 at 04:19
  • I have 250M records to insert every day, and it is finished after 2 hours. (insert then index whole data). In the future, the data is increased to 600M records. That the reason why I need to choose different database engine. So, for the math, my current situation is: inserting 250M records costs 2 hours, with the size of data is 250GB. In the future, it will increase to 600M records. I want to find a database / database engine that can optimize this process, so the running time should be around 2 hours. – Trần Kim Dự Jul 05 '19 at 04:23
  • May we see `SHOW CREATE TABLE`? I see a lot of less-than-optimal datatypes, etc. And there could be index tips. Can we see the queries that need the index(es)? Keep in mind that _any_ engine is limited by the disk speed, so shrinking the GBs helps.. – Rick James Jul 05 '19 at 04:28
  • ```CREATE TABLE IF NOT EXISTS `tableA` ( `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT, `user_id` VARCHAR(63) NOT NULL DEFAULT '', `data` JSON NOT NULL DEFAULT '{}', PRIMARY KEY (`id`), UNIQUE INDEX `user_id_UNIQUE` (`user_id` ASC)) ENGINE = InnoDB;``` Here is my schema. I only need to find a single record by `user_id`. – Trần Kim Dự Jul 05 '19 at 07:03
  • **As a side hint**, I'd probably drop the `DEFAULT ''` on `user_id`, the unique index will only ever allow one row to use the default.. and I can't see it doing anything other than causing confusion – Arth Jul 09 '19 at 09:58

1 Answers1

0

(From a MySQL point of view...)

  • Toss id and the PK -- saves 8 bytes per row.
  • Promote UNIQUE(user_id) to PRIMARY KEY(user_id) -- might save 40 bytes per row (depends on LENGTH(user_id)).

Doing those will

  • Shrink the disk I/O needed (providing some speedup)
  • Eliminate one of the indexes (probably a significant part of the post-load processing)

Run OS monitoring tools to see what percentage of the I/O is being consumed. That is likely to be the limiting factor.

Benchmarking products are handy for limited situations. For your situation (and many others), it is best to build your product and time it.

Another thought...

What does the JSON look like? If the JSON has a simple structure (a consistent set of key:value pairs), then the disk footprint might be half as much (hence speed doubling) if you made individual columns. The processing to change from JSON to individual columns would be done in the client, which may (or may not) cancel out the savings I predict.

If the JSON is more complex, there still might be savings by pulling out "columns" that are always present.

If the JSON is "big", then compress it in the client, then write to a BLOB. This may shrink the disk footprint and network bandwidth by a factor of 3.

You mentioned 250GB for 250M rows? That's 1000 bytes/row. That means the JSON averages 700 bytes? (Note: there is overhead.) Compressing the JSON column into a BLOB would shrink to maybe 400 bytes/row total, hence only 100GB for 250M rows.

{"b": 100} takes about 10 bytes. If b could be stored in a 2-byte SMALLINT column, that would shrink the record considerably.

Another thing: If you promote user_id to PK, then this is worth considering: Use a file sort to sort the table by user_id before loading it. This is probably faster than INSERTing the rows 'randomly'. (If the data is already sorted, then this extra sort would be wasted.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks for your nice answer. I have tried some options before and performance is not as good as the original design. – Trần Kim Dự Jul 05 '19 at 17:22
  • - remove id. Field `user_id` now acts as the primary key. Then every insertion will try to index field `user_id`.(difference from the old design: only index id, and it is fast. And then manually indexing again the whole table for column `user_id`). This solution is slower. - My JSON is a pretty big. And I have optimized that JSON data to make it smaller. One optimization is flattening its content for an object which has only one field. For example: `{"a": {"b": 100}}` will become `{"b": 100}`. The content is really smaller and the writing process is faster. (as the result I showed). – Trần Kim Dự Jul 05 '19 at 17:22
  • About the movement from `JSON` to `BLOB` datatype, I haven't tried yet, and I will get a try to test :D But does this improvement really fix the problem? Because I move from 250M records to 600M records, and I want to keep the writing process still be in 2 hours. (I am apologized for 3 comments, because it is too long to fit in one comment). – Trần Kim Dự Jul 05 '19 at 17:22
  • @TrầnKimDự - I addressed your Comments by adding to my Answer. – Rick James Jul 05 '19 at 18:11
  • About your last tip, I have tried that :D Actually I insert from a Hadoop system. Before inserting process, I try to sort again by id (and `user_id`, in case `user_id` as the primary key). The problem is: I use many threads for insertion, so they will lose the order while updating. (it is still maintained in each batch, though). – Trần Kim Dự Jul 05 '19 at 18:26
  • @TrầnKimDự - How big are the batches? Are the inserts being done with multi-row `INSERTs`? Or `LOAD DATA`? Seems like it could be practical to build a .csv file for the latter? (I'll get back to the 'sort' issue later.) – Rick James Jul 05 '19 at 18:29
  • - I read this official document: https://dev.mysql.com/doc/refman/5.7/en/json.html In this document, they stated that: "The space required to store a JSON document is roughly the same as for LONGBLOB or LONGTEXT;" is this point contradictory with idea you mentioned before? – Trần Kim Dự Jul 06 '19 at 07:51
  • About the `load data` method, that is the interesting idea. I will take a try to test this. – Trần Kim Dự Jul 06 '19 at 07:52
  • @TrầnKimDự - JSON size: That says that JSON, whether or not it is stored as text, is not well compressed. My tip on compressing the JSON _string_ still stands. – Rick James Jul 06 '19 at 13:54
  • @TrầnKimDự - LOAD DATA: Creating a csv file would add to the time. But, if you already have the data in a cvs-like file, then it could be a big benefit. – Rick James Jul 06 '19 at 13:55