1

I have a very long execution time (as opposed to long fetch times) using Python MySQLdb to fetch data from a big table, and I would like to understand if anything is obviously wrong.

My table is defined as follows:

create table mytable(
  a varchar(3),
  b bigint,
  c int,
  d int,
  e datetime,
  f varchar(20),
  g varchar(10),
  primary key(a, b, c, d))
ENGINE=InnoDB;

It currently contains 150 million rows, and the table size estimate is 19GB.

The Python code is as follows:

import MySQLdb
database = MySQLdb.connect(passwd="x", host="dbserver", user="user", db="database", port=9999)
mysql_query = """select a, b, c, d, e, f, g from mytable use index (primary) where a = %s order by a, b, c, d"""
mysql_cursor = database.cursor()
mysql_cursor.execute(mysql_query, ["AA"])
for a, b, c, d, e, f, g in mysql_cursor:
    #Do something

My surprise come from the time spent on the execute command. It litteraly spends ages here, although I would have expected the execute to spend almost no time (as it should traverse the table using the primary key), and having a quite long time spent in the for loop.

The explain plan is as follows:

explain select a, b, c, d, e, f, g from mytable use index (primary) where a = %s order by a, b, c, d
'1','SIMPLE','eventindex','ref','PRIMARY','PRIMARY','5','const','87402369','Using where'

At the moment, all rows contain the same value in column a (I had in mind to add other values later, but at the moment the distribution of column a content is not really balanced). Column b is better distributed

What could explain that MySQL is spending so much time in executing the query (as opposed to spending time fetching the rows)?

Bonus question. Any obvious quick win to optimize this use case? Partitioning the table on column b? Column a? Removing column a, and using separate tables instead?

Gerard Yin
  • 1,283
  • 1
  • 12
  • 24

2 Answers2

0

Looks mostly like a MySQL question actually - I don't think the problem has to do with either Python or mysql-python.

wrt/ SQL stuff: an index that's not selective enough (has too many similar values) can be quite detrimental since you end up doing sequential scans in addition to the index tree traversal - and actually much more disk access than with a plain table scan -, so you loose on both sides (IOW : you only get the overhead of the index tree traversal but none of it's benefits). You'll find more on this here : MySQL: low cardinality/selectivity columns = how to index? and here Role of selectivity in index scan/seek

In your case you may want to try the query without the use index clause, and probably even force the optimizer to plain bypass the index using an ignore index clause instead.

Community
  • 1
  • 1
bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
0

After looking, it looks like this is a normal behaviour with MySQL. From the various sources, it looks like most of the select work is done in the execute phase for MySQL, and during the fetch, only network transfer takes place. I had spent so much time with Oracle (where execute usually does almost nothing in practice and the meat of the processing takes place at fetch time) that I didn't realize MySQL could be behaving differently.

Depending on the context, a work-around to be able to iterate without latency through items can be to implement a paging system. It could be done by encapsulating smaller fetchs in a Python generator. On the other hand, we are loosing consistency of data across calls, but this is acceptable in my case. Here is a basis for those who are interested in this approach. The adaptation needed to fetch the next page makes the SQL query somehow complex messy and less maintainable, and can bind your code to your primary key structure than what you'd like to, so you probably need to weigh the pros and cons before going for this. One good news is that this complexity can be hidden behind a generator.

import MySQLdb
database = MySQLdb.connect(passwd="x", host="dbserver", user="user", db="database", port=9999)

def get_next_item(database): #Definition of this generator encapsulating the paging system
    first_call = True
    mysql_cursor = database.cursor()
    nothing_more_found = False
    while not nothing_more_found:
        mysql_query = """select a, b, c, d, e, f, g from mytable use index (primary)
            where a = %s order by a, b, c, d
            limit 10000""" if first_call else """select a, b, c, d, e, f, g from mytable use index (primary)
            where a = %s and ((b > %s) or (b = %s and c > %s) or (b = %s and c = %s and d > %s))  
            order by a, b, c, d
            limit 10000"""

        if first_call:
            mysql_cursor.execute(mysql_query, ["AA", last_b, last_b, last_c, last_b, last_c, last_d])
            first_call = False
        else:
             mysql_cursor.execute(mysql_query, ["AA"])
        if mysql_cursor.rowcount == 0:
            nothing_more_found = True
        for a, b, c, d, e, f, g in mysql_cursor:
            yield (a, b, c, d, e, f, g)
            last_b, last_c, last_d = b, c, d

for a, b, c, d, e, f, g in get_next_item(database): #Usage of the generator
    #Do something

Explanation on MySQL execute vs. fetch in this post from Mike Lischke.

The fetch time purely measures how it took to transfer the result, which has absolutely nothing to do with executing the query. The fetch time can even vary each time you run the query. Why should your network connection decide how good or bad your query is? Ok, one use actually exists: if the query returns too many data the transfer takes much longer. But even this is not entirely true, because sometimes results are cached, so they can be sent out quicker.

On the other hand, for Oracle, during a select, most of the action takes place during the fetch. This is explained by Tom Kyte himself here

think of it this way

1) parse - pretty well defined, that is prepareStatement - we do a soft or hard parse, compile the statement, figure out how to execute it.

2) execute - we OPEN the statement. For an update, for a delete, for an insert - that would be it, when you OPEN the statement, we execute it. All of the work happens here.

for select it is more complex. Most selects will do ZERO work during the execute. All we are doing is opening the cursor - the cursor is a pointer to the space in the shared pool where the plan is, your bind variable values, the SCN that represents the "as of" time for your query - in short the cursor at this point is your context, your virtual machine state, think of the SQL plan as if it were bytecode (it is) executed as a program (it is) in a virtual machine (it is). The cursor is your instruction pointer (where are you in the execution of this statement), your state (like registers), etc. Normally, a select does nothing here - it just "gets ready to rock and roll, the program is ready to go, but not yet really started".

However, there are exceptions to everything - turn on trace and do a select * from scott.emp FOR UPDATE. That is a select, but it is also an update. You would see work done during the execute as well as the fetch phase. The work done during the execute was that of going out and touching every row and locking it. The work done during the fetch phase was that of going out and retrieving the data back to the client.

3) fetch - this is where we see almost all of the work for SELECTS (and nothing really for the other DMLS as you do not fetch from an update).

There are two ways a SELECT might be processed. What I call a "quick return query" and a "slow return query"

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:275215756923#39255764276301

is an excerpt from Effective Oracle by Design describing this in depth, but suffice to say a query of the form:

select * from one_billion_row_table;

would not copy the data anywhere, would not need to access the last row before returning the first row. We would just read the data as you fetch it from the blocks it resides on.

However, a query of the form:

select * from one_billion_row_table order by unindexed_column;

that we would probably have to read the last row before returning the first row (since the last row read could well be the first row returned!) and we'd need to copy that somewhere (temp, sort area space) first.

In the case of the first query, if you:

parsed it (little work parsing) opened it (no real world, just getting ready) fetched 1 row and closed it

you would see VERY little work performed in the fetch phase, we'd just have to read one block probably to return the first record.

However, do the same steps against the second query and you would see the fetch of a single row do a TON of work - since we have to find the last row before the first can be returned.

Gerard Yin
  • 1,283
  • 1
  • 12
  • 24