9

I have 5 million rows in a MySQL DB sitting over the (local) network (so quick connection, not on the internet).

The connection to the DB works fine, but if I try to do

f = pd.read_sql_query('SELECT * FROM mytable', engine, index_col = 'ID')

This takes a really long time. Even chunking with chunksize will be slow. Besides, I don't really know whether it's just hung there or indeed retrieving information.

I would like to ask, for those people working with large data on a DB, how they retrieve their data for their Pandas session?

Would it be "smarter", for example, to run the query, return a csv file with the results and load that into Pandas? Sounds much more involved than it needs to be.

firelynx
  • 30,616
  • 9
  • 91
  • 101
Dervin Thunk
  • 19,515
  • 28
  • 127
  • 217
  • 1
    Don't retrieve 5 million records, especially for a wide table, the I/O will kill you. – dbugger Jul 29 '15 at 14:03
  • 1
    @dbugger: Yes, sorry, the query up there is just an example, I don `SELECT * FROM table`, but still, if I chunk, creating the iterator for a relatively small subset of records (>10%) will never end... So I think I'm doing something wrong with the workflow. Thanks. – Dervin Thunk Jul 29 '15 at 14:06
  • 1
    Just to be sure that there is not something else wrong (apart from the size), if you add a `LIMIT 100` (or bigger) to the query, it works as expected? – joris Jul 29 '15 at 14:56

3 Answers3

6

The best way of loading all data from a table out of -any-SQL database into pandas is:

  1. Dumping the data out of the database using COPY for PostgreSQL, SELECT INTO OUTFILE for MySQL or similar for other dialects.
  2. Reading the csv file with pandas using the pandas.read_csv function

Use the connector only for reading a few rows. The power of an SQL database is its ability to deliver small chunks of data based on indices.

Delivering entire tables is something you do with dumps.

firelynx
  • 30,616
  • 9
  • 91
  • 101
  • If you don't want to do this manually, have a look at `odo`: http://odo.readthedocs.org/en/latest/sql.html#conversions – joris Jul 29 '15 at 14:57
2

I had a similar issue whilst working with an Oracle db (for me it turned out it was taking a long time to retrieve all the data, during which time I had no idea how far it was or whether there was any problem going on) - my solution was to stream the results of my query into a set of csv files, and then upload them into Pandas.

I'm sure there are faster ways of doing this, but this worked surprisingly well for datasets of around 8 million rows.

You can see the code I used at my Github page for easy_query.py but the core function I used looked like this:

def SQLCurtoCSV (sqlstring, connstring, filename, chunksize):
    connection = ora.connect(connstring)
    cursor = connection.cursor()
    params = []
    cursor.execute(sqlstring, params)
    cursor.arraysize = 256
    r=[]
    c=0
    i=0
    for row in cursor:
        c=c+1
        r.append(row)
        if c >= chunksize:
            c = 0
            i=i+1
            df = pd.DataFrame.from_records(r)
            df.columns = [rec[0] for rec in cursor.description]
            df.to_csv(filename.replace('%%',str(i)), sep='|')
            df = None
            r = []
    if i==0:
        df = pd.DataFrame.from_records(r)
        df.columns = [rec[0] for rec in cursor.description]
        df.to_csv(filename.replace('%%',str(i)), sep='|')
        df = None
        r = []

The surrounding module imports cx_Oracle, to provide various database hooks/api-calls, but I'd expect there to be similar functions available using some similarly provided MySQL api.

What's nice is that you can see the files building up in your chosen directory, so you get some kind of feedback as to whether your extract is working, and how many results per second/minute/hour you can expect to receive.

It also means you can work on the initial files whilst the rest are being fetched.

Once all the data is saved down to individual files, they can be loaded up into a single Pandas dataframe using multiple pandas.read_csv and pandas.concat statements.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
  • Did you compare this to using `pd.read_sql_query`? As this function does basically the same (calling ``execute`` and then `fetchmany` (if using chunksize)), I would think `read_sql_query` is easier to use and even faster. – joris Jul 29 '15 at 14:55
  • I have to say no I didn't - but depending on how it manages the chuncksize parameter, that looks like it could be a solution. Q: How does it handle the chunksize parameter? e.g. Say my query will return 5 million rows, and chunksize is 100,000, what would be the result? I never really understood how chunksize worked. – Thomas Kimber Jul 29 '15 at 15:00
  • [later edit] Just found [this] (http://stackoverflow.com/questions/15555005/get-inferred-dataframe-types-iteratively-using-chunksize) which suggests there's a kind of chunking object that allows stepped views over a master dataset - so yes, could be just the thing. – Thomas Kimber Jul 29 '15 at 15:07
  • The full code above would roughly be something like: ``for chunk in pd.read_sql_query(... chunksize=..): chunk.to_csv(..)``. But the question is then if it is still needed to write it in chunks to csv. – joris Jul 29 '15 at 15:35
0

query: Write your query.
conn : Connect to your database
chunksize: Extracts data in batches. Returns a generator.

Try the below code to extract the data in chunks. Then use the function to convert the generator object to dataframe.

df_chunks = pd.read_sql_query(query, conn, chunksize=50000)

def chunks_to_df(gen):
    chunks = []
    for df in gen:
        chunks.append(df)
    return pd.concat(chunks).reset_index().drop('index', axis=1)

df = chunks_to_df(df_chunks)

This will help you reduce load on the database server and get all your data in batches and use it for your further analysis.

  • This doesn't change behavior on the server unless you set `execution_options(stream_results=True)`. See https://docs.sqlalchemy.org/en/13/core/connections.html#sqlalchemy.engine.Connection.execution_options.params.stream_results – rmorshea May 14 '20 at 23:22