37

I have to deal with a large result set (could be hundreds thousands of rows, sometimes more).
They unfortunately need to be retrieved all at once (on start up).

I'm trying to do that by using as less memory as possible.
By looking on SO I've found that using SSCursor might be what I'm looking for, but I still don't really know how to exactly use them.

Is doing a fetchall() from a base cursor or a SScursor the same (in term of memory usage)?

Can I 'stream' from the sscursor my rows one by one (or a few by a few), and if yes, what is the most efficient way to do so?

General Grievance
  • 4,555
  • 31
  • 31
  • 45
Sylvain
  • 3,202
  • 5
  • 27
  • 27

3 Answers3

37

I am in agreement with Otto Allmendinger's answer, but to make explicit Denis Otkidach's comment, here is how you can iterate over the results without using Otto's fetch() function:

import MySQLdb.cursors
connection=MySQLdb.connect(
    host="thehost",user="theuser",
    passwd="thepassword",db="thedb",
    cursorclass = MySQLdb.cursors.SSCursor)
cursor=connection.cursor()
cursor.execute(query)
for row in cursor:
    print(row)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    Does this fetch rows one by one? If not how many rows does this fetch in one go. I have a db with over 37 million records. I need to read the records one by one and put them in a file (with some additional things so cannot be a simple dump). Can this be done in parallel somehow. For example I fetch 10000 rows write them and while I write them some rows are being fetched and so on.. – Sohaib Sep 01 '14 at 08:49
  • @Sohaib: This fetches one row at a time. If your problem is CPU-bound, then you could use multiprocessing (Python2/3) or concurrent.futures (in Python3) to set up multiple DB readers, but you should only use one writer to the file, or else the records will get garbled. [If your problem is IO-bound](http://eli.thegreenplace.net/2012/01/16/python-parallelizing-cpu-bound-tasks-with-multiprocessing/) -- if writing to the file is the bottleneck -- then setting up multiple readers will not speed up the job. – unutbu Sep 01 '14 at 11:22
  • wouldn't `cursor.arraysize` set the number of records to fetch per request? Not sure if it only applies to `fetchmany` and not `fetchone` but it should be handy anyway if supported by `MySQLdb`. – sirfz Dec 16 '16 at 16:01
  • how do you close the connection on this? `connection.close()`? – Adders Aug 25 '18 at 23:12
16

Definitely use the SSCursor when fetching big result sets. It made a huge difference for me when I had a similar problem. You can use it like this:

import MySQLdb
import MySQLdb.cursors

connection = MySQLdb.connect(
        host=host, port=port, user=username, passwd=password, db=database, 
        cursorclass=MySQLdb.cursors.SSCursor) # put the cursorclass here
cursor = connection.cursor()

Now you can execute your query with cursor.execute() and use the cursor as an iterator.

Edit: removed unnecessary homegrown iterator, thanks Denis!

Otto Allmendinger
  • 27,448
  • 7
  • 68
  • 79
2

Alternatively, you can use SSCursor outside the connection object (it is pretty important when you already define connection and dont want all the connection use SSCursor as a cursorclass).

import MySQLdb
from MySQLdb.cursors import SSCursor # or you can use SSDictCursor

connection = MySQLdb.connect(
        host=host, port=port, user=username, passwd=password, db=database)
cursor = SSCursor(connection)
cursor.execute(query)
for row in cursor:
    print(row)   
Yuda Prawira
  • 12,075
  • 10
  • 46
  • 54