7

I am having a performance issue when inserting some data in a mysql table. The table has a bunch of columns, let's say DATE,A,B,C,D,E,F where DATE,A,B,C,D,E is the primary key. Every day, I insert 70k rows in that table (with a different date), and this table contains 18 million rows now. The method I use to insert the rows is just sending 70k INSERT queries.

The problem I am having is that the queries started to take a lot more time than they used to. Going from a few minutes to a few hours. I profiled the inserts and this is the charts I got:

Speed of each insert (in sec) vs. Number of insert for that day: Speed of each insert (in sec) vs. Number of insert for that day

A few strange facts:

  1. Most queries take less than 2 ms to execute
  2. The speed of slow queries increases linearly with the number of rows in the table for that date
  3. This behavior only happens at night, after a bunch of processes have happened on the database. Inserting during the day is fast, so is weekends
  4. The overall speed doesn't depend on what else is running on the database, in fact, nothing else is running on the database when this happens
  5. There is nothing in the query that can explain that a query is fast or no, the fast ones are very similar to the slow one, and from one day to another are not the same set.
  6. the behavior does not change from one day to the next.

Any idea what could cause this?

** Edit ** the columns in the index are in the following order:

DATE NOT NULL,
DATE NOT NULL,
VARCHAR (10) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (3) NOT NULL,
VARCHAR (45) NOT NULL,
DOUBLE NOT NULL,
VARCHAR (10) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,

The Dates are either the same as today, or left empty, the double is always the same number (no clue who designed this table)

edeboursetty
  • 5,669
  • 2
  • 40
  • 67
  • There's probably a correlation between number of records inserted and number already in the database. Larger indexes take more time to update. – tadman Sep 17 '13 at 19:02
  • @tadman: how would you explain that it only happens during certain times of the day? – edeboursetty Sep 17 '13 at 19:04
  • What indices to you have on the table? Do any of them include a date (without time)? If so, that could explain the discrepancy over time, as the b-tree rebalancing will increase over the course of the day... – PinnyM Sep 17 '13 at 19:04
  • I don't know what the performance issue is. However, I might suggest that you load the 70,000 new inserts into a temporary table and then use that for inserting. – Gordon Linoff Sep 17 '13 at 19:54
  • @PinnyM: there is only one index, that of the primary key. Yes there is a date, but it's always the same date for a set of 70k queries and is the leftmost item in the index – edeboursetty Sep 17 '13 at 19:57
  • Does the date increment every day? – PinnyM Sep 17 '13 at 20:00
  • @PinnyM: yes, it does – edeboursetty Sep 17 '13 at 20:00
  • Getting closer... and what types are the other fields in the index? In the sequence that the index is defined, please. – PinnyM Sep 17 '13 at 20:01
  • @PinnyM: added that as an edit to the original post – edeboursetty Sep 17 '13 at 20:06
  • Shouldn't those 70k rows easily fit into memory? I'd expect very little physical IO here. Certainly not on the order of .4sec for a single worst-case insert. Be there tree rebalancing or not. Also, maybe you can switch to InnoDB and insert all rows in one transaction. – usr Sep 17 '13 at 21:22
  • @usr, just because the table fits in memory doesn't mean there isn't i/o for writes... – PinnyM Sep 18 '13 at 03:31
  • @PinnyM lazy writing would convert the IO into sequential IO. 70k rows sequentially written should be *nothing*. 1sec or so in total. – usr Sep 18 '13 at 10:17
  • @usr, I was under the impression that lazy writing for MyISAM would only occur for INSERT/UPDATE statements declared as DELAYED, is that incorrect? In any event, rebalancing a tree involving 18M rows is not necessarily a trivial task... – PinnyM Sep 18 '13 at 12:44

1 Answers1

3

The short explanation is that you have an index that is non-incremental within the scope of a single day. Non-incremental indices are generally slower to insert/update because they will more often require rebalancing the index tree, and to a greater extent, than an incremental index.

To explain this further - assume the following schema:

a (int) | b (varchar)

And the index is (a, b)

Now we insert:

1, 'foo'
2, 'bar'
3, 'baz'

This will be quite fast because the index will append on each insert. Now lets try the following:

100, 'foo'
100, 'bar'
100, 'baz'

This won't quite be as fast since 'bar' needs to be inserted before 'foo', and 'baz' needs to insert between the other 2. This often requires the index to rewrite the tree to accomodate, and this 'rebalancing' act takes some time. The larger the components involved in the rebalancing (in this case, the subset where a=100), the more time it will take. Note that this rebalancing activity will only occur more often and more extensively, but not necessarily on each insert. This is because the tree will usually leave some room within the leaves for expansion. When the leaves runs out of room, it knows that it's time to rebalance.

In your case, since your index is primarily based on the current date, you are constantly rebalancing your tree within the scope of the single day. Each day starts a new scope, and as such starts rebalancing within that day's scope. Initially this involves just a bit of rebalancing, but this will grow as your scope of existing entries for the day increases. The cycle starts over as you start a new day, which is the result you are seeing.

That this is happening to the primary key may make matters even worse, since instead of shifting some index pointers around, entire rows of data may need to be shifted to accommodate the new entry. (This last point assumes that MyISAM clustering is performed on the primary key, a point that I haven't gotten clarification on to this day, although anectodal evidence does seem to support this. For example, see here and here.)

Community
  • 1
  • 1
PinnyM
  • 35,165
  • 3
  • 73
  • 81
  • 1
    This is good general explanation of why indexing can slow down on larger tables. One thing you might suggest is looking into use of table partitioning to break that table down into more logical part (like month-based partitions). Also – Mike Brant Sep 17 '13 at 20:48