7

While Josh's answer here gave me a good head start on how to insert a 256x64x250 value array into a MySQL database. When I actually tried his INSERT statement on my data it turned out horribly slow (as in 6 minutes for a 16Mb file).

ny, nx, nz = np.shape(data)
query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s)"""
for frames in range(nz):
    for rows in range(ny):
        for cols in range(nx):
            cursor.execute(query, (frames, rows, cols, data[rows,cols,frames]))

I was reading MySQL for Python, which explained that this wasn't the right approach because executing 4 million separate inserts is very inefficient.

Now my data consist out of a lot of zeros (more than 90% actually), so I threw in an IF statement so I only insert values greater than zero and I used executemany() instead:

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
values = []
for frames in range(nz):
    for rows in range(ny):
        for cols in range(nx):
            if data[rows,cols,frames] > 0.0:
                values.append((frames, rows, cols, data[rows,cols,frames]))           
cur.executemany(query, values)

This miraculously brought my processing time down to about 20 seconds, of which 14 seconds spend on creating the list of values (37k rows) and 4 seconds on the actual inserting into the database.

So now I'm wondering, how can I speed up this process any further? Because I have a feeling my loop is horribly inefficient and there has to be a better way. If I need to insert 30 measurements per dog, this would still take 10 minutes, which seems far too long for this amount of data.

Here are two versions of my raw files: with headers or without headers. I'd love to try the LOAD DATA INFILE, but I can't figure out how to parse the data correctly.

Community
  • 1
  • 1
Ivo Flipse
  • 10,222
  • 18
  • 50
  • 63
  • Where does the data come from? – Alp Mar 27 '11 at 15:16
  • It's a large text file with some headers above each 'frame', what is it specifically you want to know? – Ivo Flipse Mar 27 '11 at 15:20
  • I am just curious if the long taking process is reading the file or the nested for loops to create the query. – Alp Mar 27 '11 at 15:23
  • What is `data`? Maybe you can obtain `values` by applying some function on `data`? This could give you major speedup. – pajton Mar 27 '11 at 15:24
  • @alp & @pajton: data is a numpy array, which is loaded in the memory, so my guess is that it comes from the nested loops. If there's another way to retrieve all the non-zero values with their indexes, then I'm all ears – Ivo Flipse Mar 27 '11 at 15:27
  • Keep in mind that the more [INSERT] operations are performed, the likelihood of IO performance will be an issue. Databases write to disk - while there is a buffer, it can be exhausted. Also, you might want to look at [MySQL's LOAD DATA INFILE](http://dev.mysql.com/doc/refman/5.1/en/load-data.html) for loading from files, though I don't think it supports the decision logic you're using to throw out data (why is it being logged in the first place?). – OMG Ponies Mar 27 '11 at 16:21
  • @OMGPonies, the data is exported from another software, so I don't have any influence on the format. I can imagine that loading a file straight into the database would be even faster, but I'm not sure how to make sure that it has the right format. – Ivo Flipse Mar 27 '11 at 16:32
  • I've added an EDIT to my original answer which might be of interest. – Jon Black Mar 29 '11 at 06:54
  • @IvoFlipse `numpy.nonzero()` – Will Jul 30 '13 at 18:31

6 Answers6

6

the fastest way to insert 4 million rows (16MB of data) would be to use load data infile - http://dev.mysql.com/doc/refman/5.0/en/load-data.html

so if possible generate a csv file then use load data infile..

hope this helps :)

EDIT

So I took one of your original data files rolloff.dat and wrote a quick and dirty program to convert it to the following csv format.

Download frames.dat from here: http://rapidshare.com/files/454896698/frames.dat

Frames.dat

patient_name, sample_date dd/mm/yyyy, frame_time (ms), frame 0..248, row 0..255, col 0..62, value
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,5,39,0.4
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,5,40,0.4
...
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,10,42,0.4
"Krulle (opnieuw) Krupp",04/03/2010,0.00,0,10,43,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,4,40,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,5,39,0.4
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,5,40,0.7
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,6,44,0.7
"Krulle (opnieuw) Krupp",04/03/2010,7.94,1,6,45,0.4
...
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,10,0.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,11,0.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,12,1.1
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,13,1.4
"Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,14,0.4

The file contains data only for frames that have values for each row and col - so zeros are excluded. 24799 data rows were generated from your original file.

Next, I created a temporary loading (staging) table into which the frames.dat file is loaded. This is a temporary table which will allow you to manipulate/transform the data before loading into the proper production/reporting tables.

drop table if exists sample_temp;
create table sample_temp
(
patient_name varchar(255) not null,
sample_date date,
frame_time decimal(6,2) not null default 0,
frame_id tinyint unsigned not null,
row_id tinyint unsigned not null,
col_id tinyint unsigned not null,
value decimal(4,1) not null default 0,
primary key (frame_id, row_id, col_id)
)
engine=innodb;

All that remains is to load the data (note: i am using windows so you'll have to edit this script to make it linux compatible - check pathnames and change '\r\n' to '\n')

truncate table sample_temp;

start transaction;

load data infile 'c:\\import\\frames.dat' 
into table sample_temp
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\r\n'
ignore 1 lines
(
patient_name,
@sample_date,
frame_time,
frame_id,
row_id,
col_id,
value
)
set 
sample_date = str_to_date(@sample_date,'%d/%m/%Y');

commit;

Query OK, 24799 rows affected (1.87 sec)
Records: 24799  Deleted: 0  Skipped: 0  Warnings: 0

The 24K rows were loaded in 1.87 seconds.

Hope this helps :)

Jon Black
  • 16,223
  • 5
  • 43
  • 42
  • 1
    MySQL allows you to set what the delineators are. – ceejayoz Mar 27 '11 at 16:37
  • @ceejayoz, how to I figure out what columns I need to assign to read the file correctly? Or would I have to parse the 'raw' file Python to remove headers and such? – Ivo Flipse Mar 27 '11 at 17:44
  • I read that page, but for the life of me I can't figure out how to properly parse the data. Care to elaborate on it, because it would remove some 'redundant' steps – Ivo Flipse Mar 27 '11 at 21:36
  • i'd be happy to help but can you post a little more info about what you do with the data once loaded as it will determine the direction of my design. Posting any table definitions you have, numbers of patients, frequency of scans/measurements, typical queries etc etc... – Jon Black Mar 28 '11 at 12:15
  • @f00, sorry it took so long, but [I decided to wrap the additional info up in a blog post.](http://superivo.wordpress.com/2011/03/28/database-designing-for-dummies/). If you need any specific information, let me know. – Ivo Flipse Mar 28 '11 at 19:25
  • Thanks for the interesting answer @f00, but wouldn't the converting to .dat or any CSV file create as much overhead as your very fast solution solves? – Ivo Flipse Mar 29 '11 at 11:21
  • for the sample data files you supplied (rolloff) the conversion, written in C# in my case, took a few milliseconds. – Jon Black Mar 29 '11 at 15:49
  • Hmmm in that case I might still look into a batch processing tool for transforming these files @f00. Either way, thanks for the update I like how fast it is! – Ivo Flipse Mar 29 '11 at 16:19
  • No probs - but that's only the start of things, you'll have to consider how to transform the data in the sample_temp table to best suit your reporting/analysis requirements. Summary tables (materialised views), use of innodb clustered primary keys to optimise queries and many, many more considerations... Anyway, always happy to help out if you're in need of advice :) – Jon Black Mar 29 '11 at 16:40
5

If the data is a numpy array, you can try this:

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
values = []
rows, cols, frames = numpy.nonzero(data)
for row, col, frame in zip(rows, cols, frames):
    values.append((frame, row, col, data[row,col,frame]))

cur.executemany(query, values)

or

query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """
rows, cols, frames = numpy.nonzero(data)
values = [(row, col, frame, val) for row, col, frame, val in zip(rows, cols, frames, data[rows,cols,frames])]
cur.executemany(query, values)

Hope it helps

Hernan
  • 5,811
  • 10
  • 51
  • 86
  • I think there's something wrong with the list comprehension, but your first loop is blazingly fast: 0.34 seconds – Ivo Flipse Mar 27 '11 at 16:15
  • What error are you getting? Are you running also the rows, cols, frames = numpy.nonzero(data)? I have edited the post to make it clear? – Hernan Mar 27 '11 at 16:31
  • No that explained it ;-) It's even slightly faster! – Ivo Flipse Mar 27 '11 at 16:34
  • Just for curiosity, can you try (notice the ;, I cannot put new lines in the comments): ndx = numpy.nonzero(data); values = numpy.hstack((numpy.transpose(ndx), data[ndx].reshape((-1, 1)))); cur.executemany(query, values) or ndx = numpy.nonzero(data); values = numpy.hstack((numpy.transpose(ndx), data[ndx].reshape((-1, 1)))); cur.executemany(query, values.tolist()) – Hernan Mar 27 '11 at 16:51
  • It needs the values.tolist(), but this finishes the whole thing in 3.65 secs. Most of that time is spent on the executemany. Though off course it's a lot less readable – Ivo Flipse Mar 27 '11 at 17:12
1

I don't use Python or mySQL but batch insert performance can often be sped up with transactions.

Tim
  • 5,371
  • 3
  • 32
  • 41
  • 1
    I'm struggling to think how adding transactions into the mix would improve performance... – Marc Gravell Mar 27 '11 at 15:21
  • 1
    Why do I deserve a downvote because you happen to be struggling to think? – Tim Mar 27 '11 at 17:39
  • @Tim - I'm pretty sure the downvote wasn't by Marc. (It wasn't me, either.) I could be wrong, but I'm guessing it was ceejayoz, who is the only participant in this thread who has registered any downvotes today. I hate anonymous downvotes; the voter should at least give the poster some feedback about what's wrong. – Ted Hopp Mar 27 '11 at 17:48
  • 2
    I didn't downvote you, but compared to the other answers, I don't really see how it helped me either. If only because I know so little of database that I don't even know what you mean with "transactions" in this context – Ivo Flipse Mar 27 '11 at 17:53
  • @Tim rather than focusing on the downvote, it might be a better option to explain *how* a transaction (which usually *adds* a small amount of overhead for the process with the transaction, and potentially a few blocks as well) would improve performance here. – Marc Gravell Mar 28 '11 at 04:57
  • @Marc: OP had mentioned 30 data points (columns) eventually in his row. My assumption was that some of these columns would be indexed. A common "performance technique" for batch inserts is to temporarily disable indexing and re-enable the index afterwards. Second assumption: wrapping a batch insert in a transaction would postpone the index-writes until after the explicit commit, rather than having the index-write happening coincident with the next iteration's insert, as would be the case when individual rows are inserted in a tight loop with an implicit commit on each iteration. – Tim Mar 28 '11 at 13:49
  • 1
    @Tim - well, I'm not sure 30 data points is going to make much difference there, but good content. You should move that into the answer. +1 here (and the math should tell you the -1 wasn't me) – Marc Gravell Mar 28 '11 at 15:32
0

If i understand this correctly, executemany() executes an INSERT INTO query for each row you want to insert. This can be improved by creating a single INSERT query with all values, which should look like this:

INSERT INTO data
  (frame, sensor_row, sensor_col, value)
VALUES
 (1, 1, 1, 1),
 (2, 2, 2, 2),
 (3, 3, 3, 3),
 ...

Your python code should generate the row values in the brackets and create one query string out of it to finally execute the query once.

Alp
  • 29,274
  • 27
  • 120
  • 198
  • That's what this line does @Alp: `values.append((frames, rows, cols, data[rows,cols,frames]))` The problem is, it's horribly slow to compose such a list and I have no idea how to optimize it – Ivo Flipse Mar 27 '11 at 15:13
  • Maybe i didnt get it, but seeing your code, i think the following queries would be executed: **INSERT INTO data (frame, sensor_row, sensor_col, value) VALUES (1, 1, 1, 1);**, **INSERT INTO data (frame, sensor_row, sensor_col, value) VALUES (2, 2, 2, 2);** and so on. Correct me if i am wrong. – Alp Mar 27 '11 at 15:18
  • print(query, values) returns `('INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) ', [(0, 31, 45, 0.40000001), (0, 31, 46, 0.40000001), (0, 32, 45, 0.40000001),` – Ivo Flipse Mar 27 '11 at 15:24
  • Ok then, consider my answer as useless. – Alp Mar 27 '11 at 15:26
0

Inserting multiple rows on each statement is one way of optimizing. However, why the need for the 3 loops? Maybe some sort of data transformation might be useful instead.

Another option is to disable indexes during the insertion, if you are certain that you won't have any duplicate data (assuming you actually have indexes on the table). Indexes must be updated for each statement, and also checked to prevent duplicates.

Invoke ALTER TABLE tablename DISABLE KEYS before starting your inserts, and when finished invoke ALTER TABLE tablename ENABLE KEYS and see if it helps

From the manual:

ALTER TABLE ... DISABLE KEYS tells MySQL to stop updating nonunique indexes. ALTER TABLE ... ENABLE KEYS then should be used to re-create missing indexes. MySQL does this with a special algorithm that is much faster than inserting keys one by one, so disabling keys before performing bulk insert operations should give a considerable speedup. Using ALTER TABLE ... DISABLE KEYS requires the INDEX privilege in addition to the privileges mentioned earlier.

jishi
  • 24,126
  • 6
  • 49
  • 75
  • I guess I'm doing something wrong, because I tried `cur.execute("""ALTER TABLE data ENABLE KEYS""")` and got **Table storage engine for 'data' doesn't have this option** So I guess I'm doing something wrong. – Ivo Flipse Mar 27 '11 at 18:05
  • Hm, maybe it only works for MyISAM tables. What are you using? – jishi Mar 27 '11 at 18:19
  • I had it on InnoDB (default setting in HeidiSQL) but changing it to MyISAM gives the same error. I guess my table design doesn't match the data format – Ivo Flipse Mar 27 '11 at 18:24
  • Yeah, it doesn't work with innoDB but should be fine with MyISAM, that's odd. – jishi Mar 27 '11 at 18:48
  • However, InnoDB is pretty slow for non-concurrent operations, having multiple threads/connection inserting data in that case would probably be faster, but if MyISAM is an option you might wanna try that. Disable keys should work with MyISAM. – jishi Mar 27 '11 at 18:50
  • @Jishi, I ran the script a couple of times with your lines in it and hadn't noticed that it didn't give the error message when I had it set on MyISAM, so yes it was working. However, that way it took 6.4 seconds whereas on InnoDB, that doesn't support it, it only takes 5.8 seconds – Ivo Flipse Mar 27 '11 at 19:13
-1

You could employ list comprehenshions instead of for loops:

values = [(frames, rows, cols, data[rows,cols,frames]) \
        for frames in range(nz) for rows in range(ny) \
        for cols in range(nx) if data[rows,cols,frames] > 0.0]           

I'd estimate this could give you slight speed up such as 10-20%.

pajton
  • 15,828
  • 8
  • 54
  • 65