0

Update*: After looking at my Python script I found the preparing and calculating the data before inserting the records is where the biggest processing time takes place... not the DB Insert. The inserts actually are relatively quick (about 30 seconds for 10,000 records with a table size of 1,000,000 records). I do however think the feedback I got on this question will help me optimize the inserts also. Thanks to all who replied. Now I'm off to try to optimize my Python script to speed it up.

I currently have a MariaDB table with the following structure:

CREATE TABLE IF NOT EXISTS `adeck_errors` (
  `StormID` varchar(8) NOT NULL DEFAULT '1',
  `ModelBaseTime` datetime NOT NULL,
  `Model` varchar(4) NOT NULL,
  `Tau` smallint(4) NOT NULL,
  `LatCARQ` float DEFAULT NULL,
  `LonCARQ` float DEFAULT NULL,
  `LatModel` float DEFAULT NULL,
  `LonModel` float DEFAULT NULL,
  `DistError` smallint(6) DEFAULT NULL,
  `WindCARQ` int(11) DEFAULT NULL,
  `WindModel` int(11) DEFAULT NULL,
  `WindError` smallint(6) DEFAULT NULL,
  `PresCARQ` int(11) DEFAULT NULL,
  `PresModel` int(11) DEFAULT NULL,
  `PresError` smallint(6) DEFAULT NULL,
  UNIQUE KEY `StormID` (`StormID`,`ModelBaseTime`,`Model`,`Tau`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

As you can see my unique key is across 4 fields in the table to ensure I do not get any duplicate entries. When I'm inserting 10,000 + records into a table that already contains 1 million records, this takes about 15-20 minutes and I would like to speed this process up.

My question is would it be better to create a hash of the data in the 4 fields and use that hash as a primary key so I don't have duplicates? When I'm processing the data before inserting (with Python), I can quickly create a hash value of the 4 pieces of data before I move into my DB Insert function.

Thanks for your suggestions. Bryan

Bryan
  • 81
  • 2
  • 8
  • Your hash idea might work if you could establish that you would never have collisions. – Tim Biegeleisen Sep 11 '18 at 01:00
  • You might like my presentation, [Load Data Fast!](https://www.slideshare.net/billkarwin/load-data-fast) – Bill Karwin Sep 11 '18 at 01:09
  • 3
    I don't know how you're inserting data (you only say "with Python"), but it shouldn't be that slow. I just tested and it takes me 14.22 seconds to insert over 1 million rows into the table you show, running on my Macbook. Maybe you're using some terribly inefficient method (e.g. ActiveRecord ORM) that does a lot of overhead work for each row? Unless we know why it's so slow for you, one cannot say if using a hash would make any difference. There's no reason to think it would. – Bill Karwin Sep 11 '18 at 01:33
  • 2
    Can you edit the question to include your python code? – Sam M Sep 11 '18 at 02:07
  • 1
    With the size of the columns in question (they are quite small, especially on latin1) it seems a bit of overkill to make hash. May as well make the `UNIQUE KEY` the `PRIMARY KEY`. – danblack Sep 11 '18 at 03:49

2 Answers2

0

Currently you don't have a "proper" or "normal" primary key. This is relevant because the primary key is used as the so called clustered index to order the rows on the disk. InnoDB will pick the next suitable key as a candidate for the clustered index:

  • If you do not define a PRIMARY KEY for your table, MySQL locates the first UNIQUE index where all the key columns are NOT NULL and InnoDB uses it as the clustered index.

In your case it's the only existing UNIQUE KEY "StormId". Your rows are sorted on the disk by this key and now you have the same problem as using a GUID/UUID as the primary key. When you read questions and articles like The differences between int and uuid in mysql, http://kccoder.com/mysql/uuid-vs-int-insert-performance/ or https://www.percona.com/blog/2007/03/13/to-uuid-or-not-to-uuid/ you see you shouldn't use such a key for sorting the rows on the disk.

Create a normal BIGINT AUTO_INCREMENT PRIMARY KEY column to get better performance.

Progman
  • 16,827
  • 6
  • 33
  • 48
0
  • Change the UNIQUE key to be PRIMARY. This will probably have no impact, however, it will make your intent clearer.
  • The dedupping should work fine.
  • For better performance, sort the incoming data according to the PK before starting the load.
  • How are you loading? If you receive the data in a csv file, LOAD DATA is best, multi-row INSERTs is second best; one-row-at-a-time is slowest.
  • A UUID would be really bad for performance; don't do it.
  • innodb_buffer_pool_size should be about 70% of available RAM.
Rick James
  • 135,179
  • 13
  • 127
  • 222