10

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:

import pandas as pd
import numpy as np
import pymysql.cursors

connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')

try:
    with connection.cursor() as cursor:
        query = "SELECT * FROM example_table;"

        chunks=[]

        for chunk in pd.read_sql(query, connection, chunksize = 1000):
            chunks.append(chunk)
        #print(len(chunks))    
        result = pd.concat(chunks, ignore_index=True)
        #print(type(result))
        #print(result)

finally:
    print("Done!")

    connection.close()

Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.

Maybe is there a better/faster way to select the data from a relational database within python?

UgoL
  • 839
  • 2
  • 13
  • 37
  • https://stackoverflow.com/questions/34180448/mysql-retrieve-a-large-select-by-chunks this look similar to your problem – user7422128 Jul 03 '18 at 10:29

3 Answers3

8

Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.

Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.

This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.

Example 1

import pandas as pd
import MySQLdb

def worker(y):
    #where y is value in an indexed column, e.g. a category
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x

p.close()
results = pd.concat(data) 

Example 2

import pandas as pd
import MySQLdb
import datetime

def worker(a,b):
    #where a and b are timestamps
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)

date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)

p.close()
results = pd.concat(data)

Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.

Community
  • 1
  • 1
djmac
  • 827
  • 5
  • 11
  • 27
  • 1
    I think i haven't understand exactly how these examples work.. let's say the first one. You map and call the function worker for each items of the list col_x_categories. But where we define y and col_x_categories? And I supposed that this map operation is parallelized for n processes. – UgoL Jul 03 '18 at 17:19
  • Is it doable with Vaex? – SteveS Mar 16 '22 at 11:40
3

You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).

pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.

Usage is basically the same as pymsql:

import MySQLdb

connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')

Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?

djmac
  • 827
  • 5
  • 11
  • 27
  • Thanks @djmac for this suggestion. I'll give a try. What can I take in consideration for speed-up in general the reading process? Maybe this chunks solution is not so well optimized? – UgoL Jul 03 '18 at 11:43
  • 1
    Honestly, I think it mainly kind of depends *a lot* on the database structure (and data). If you really doing a `SELECT *` on a whole table maybe a different data structure (i.e. not MySQL) could be better? Otherwise, making sure you have the right index(s) can make a big difference (the `EXPLAIN` statement can help you out here and provide guidance on good tweaks) – djmac Jul 03 '18 at 12:25
  • Okay. I am not so familiar with the intenals of `pandas` - but the MySQL-python connector (on which mysqlclient is based) is about 1000% faster than pymysql (according to this https://wiki.openstack.org/wiki/PyMySQL_evaluation). – djmac Jul 03 '18 at 12:36
  • Another option would be to consider parallelisation (e.g. use `multiprocessing` module and a pool of workers). For example, you could have multiple workers selecting smaller chunks and stitching them together with `pd.concat`. You could even shard (e.g. vertically partition) the database and get even better performance - though if the bottleneck is `pandas` that might not be as good bang for your buck. – djmac Jul 03 '18 at 12:42
  • Thank you the for the last suggestions... Can you show me a little example of multiprocessing application considering my actual experimental code? – UgoL Jul 03 '18 at 12:45
  • 1
    Sure - will put it as another answer – djmac Jul 03 '18 at 13:19
1

For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.

import mysql.connector

i = 1
limit = 1000

while True:
    sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
    cursor.execute(sql)
    rows = self.cursor.fetchall()

    if not len(rows):  # break the loop when no more rows
        print("Done!")
        break

    for row in rows:   # do something with results
        print(row)

    i += limit
Yoannes Geissler
  • 791
  • 1
  • 9
  • 18
  • It will be slow later. Because of "limit". if "limit" used, then full table scanning will occur. "join" will be a proper solution. like this ''' "SELECT id, xValue, yValue, zValue, createdAt FROM ( SELECT id FROM Position LIMIT 2000000, 1000 ) q JOIN Position p ON p.id = q.id" ''' – han058 Apr 06 '20 at 00:48