17

I'm using MySQLdb and Python. I have some basic queries such as this:

c=db.cursor()
c.execute("SELECT id, rating from video")
results = c.fetchall()

I need "results" to be a NumPy array, and I'm looking to be economical with my memory consumption. It seems like copying the data row by row would be incredibly inefficient (double the memory would be required). Is there a better way to convert MySQLdb query results into the NumPy array format?

The reason I'm looking to use the NumPy array format is because I want to be able to slice and dice the data easily, and it doesn't seem like python is very friendly to multi-dimensional arrays in that regard.

e.g. b = a[a[:,2]==1] 

Thanks!

doug
  • 69,080
  • 24
  • 165
  • 199
thegreatt
  • 1,339
  • 2
  • 12
  • 18

3 Answers3

25

This solution uses Kieth's fromiter technique, but handles the two dimensional table structure of SQL results more intuitively. Also, it improves on Doug's method by avoiding all the reshaping and flattening in python data types. Using a structured array we can read pretty much directly from the MySQL result into numpy, cutting out python data types almost entirely. I say 'almost' because the fetchall iterator still produces python tuples.

There is one caveat though, but it's not a biggie. You must know the data type of your columns and the number of rows in advance.

Knowing the column types should be obvious, since you know what the query is presumably, otherwise you can always use curs.description, and a map of the MySQLdb.FIELD_TYPE.* constants.

Knowing the row count means you have to use client side cursor (which is the default). I don't know enough about the internals of MySQLdb and the MySQL client libraries, but my understanding is that the entire result is fetched into client side memory when using client side cursors, although I suspect there's actually some buffering and caching involved. This would mean using double memory for the result, once for the cursor copy and once for the array copy, so it's probably a good idea to close the cursor as soon as possible to free up the memory if the result set is large.

Strictly speaking, you don't have to provide the number of rows in advance, but doing so means the array memory is allocated once off in advance, and not continuously resized as more rows come in from the iterator which is meant to provide a huge performance boost.

And with that, some code

import MySQLdb
import numpy

conn = MySQLdb.connect(host='localhost', user='bob', passwd='mypasswd', db='bigdb')
curs = conn.cursor() #Use a client side cursor so you can access curs.rowcount
numrows = curs.execute("SELECT id, rating FROM video")

#curs.fetchall() is the iterator as per Kieth's answer
#count=numrows means advance allocation
#dtype='i4,i4' means two columns, both 4 byte (32 bit) integers
A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4'))

print A #output entire array
ids = A['f0'] #ids = an array of the first column
              #(strictly speaking it's a field not column)
ratings = A['f1'] #ratings is an array of the second colum

See the numpy documentation for dtype and the link above about structured arrays for how to specify column data types, and column names.

sirlark
  • 2,187
  • 2
  • 18
  • 28
  • In case someone is looking for a 2D array here instead of a structured array, it's pretty easy to convert it: `ndarray_data = A.view(np.int32).reshape((len(A),-1))` Substitute the best type for all your data. – Jesse Pangburn Apr 19 '17 at 21:23
16

The fetchall method actually returns an iterator, and numpy has the fromiter method to initialize an array from an interator. So, depending on what data is in the table you could combine the two easily, or use an adapter generator.

Keith
  • 42,110
  • 11
  • 57
  • 76
  • 1
    Fromiter only generates a 1-d array object though, right? In this example we'd need a 2-d.. I suppose you could convert it somehow, but in that case would this still be the most efficient method? – thegreatt Aug 16 '11 at 16:19
  • Yes, you can reshape it afterwards. – Keith Aug 16 '11 at 22:04
  • Numpy arrays are very efficient that way. You can set the shape attribute to a tuple `(2,)` and that should work. – Keith Aug 16 '11 at 22:09
  • Hi Keith, thanks for that info - glad to know Numpy can handle these gracefully. Unfortunately I'm struggling with the fromiter() function you recommended.. `results = c.fetchall()` `D = np.fromiter(results, dtype=float, count=-1)` gives `ValueError: setting an array element with a sequence.`. Doesn't seem to matter whether results is 1D or 2D - any ideas? – thegreatt Aug 17 '11 at 04:10
  • Try taking out the "iterable" keyword argument, and making it a positional (first) argument instead. – Keith Aug 17 '11 at 04:16
  • Oh, one other thing, this is what I mean by "adapter". The fetchall returns (iterates over) tuples, even if you only select one column. You will need to pull out element zero from each row. You can use itertools.imap for that. – Keith Aug 17 '11 at 04:18
7

NumPy's fromiter method seems best here (as in Keith's answer, which preceded this one).

Using fromiter to recast a result set, returned by a call to a MySQLdb cursor method, to a NumPy array is simple, but there are a couple of details perhaps worth mentioning.

import numpy as NP
import MySQLdb as SQL

cxn = SQL.connect('localhost', 'some_user', 'their_password', 'db_name')
c = cxn.cursor()
c.execute('SELECT id, ratings from video')

# fetchall() returns a nested tuple (one tuple for each table row)
results = cursor.fetchall()

# 'num_rows' needed to reshape the 1D NumPy array returend by 'fromiter' 
# in other words, to restore original dimensions of the results set
num_rows = int(c.rowcount)

# recast this nested tuple to a python list and flatten it so it's a proper iterable:
x = map(list, list(results))              # change the type
x = sum(x, [])                            # flatten

# D is a 1D NumPy array
D = NP.fromiter(iterable=x, dtype=float, count=-1)  

# 'restore' the original dimensions of the result set:
D = D.reshape(num_rows, -1)

Note that fromiter returns a 1D NumPY array,

(This makes sense, of course, because you can use fromiter to return just a portion of a single MySQL Table row, by passing a parameter for count).

Still, you'll have to restore the 2D shape, hence the predicate call to the cursor method rowcount. and the subsequent call to reshape in the final line.

Finally, the default argument for the parameter count is '-1', which just retrieves the entire iterable

doug
  • 69,080
  • 24
  • 165
  • 199
  • Thanks, I think this is exactly what I was looking for. Although, when I try to run your code, it tells me "TypeError: Required argument 'iter' (pos 1) not found". Does it run for you? `c.execute("SELECT id, rating FROM video")` `results = c.fetchall()` `num_rows = int(c.rowcount)` `D = np.fromiter(iterable=results, dtype=float, count=-1)` `D = D.reshape(num_rows, -1)` – thegreatt Aug 15 '11 at 14:05
  • edited my Answer to include the intermediate steps of recasting and flattening 'results'. To save typing, i didn't include these trivial steps in my original answer instead just stating in a comment line that "'result' is a nested tuple" – doug Aug 17 '11 at 07:07