3

I have to carry out some statistical treatments on data that is stored in PostgreSQL tables. I have been hesitating between using R and Python.

With R I use the following code:

require("RPostgreSQL")
(...) #connection to the database, etc
my_table <- dbGetQuery(con, "SELECT * FROM some_table;")

which is very fast : it will take only 5 seconds to fetch a table with ~200 000 lines and 15 columns and almost no NULL's in it.

With Python, I use the following code:

import psycopg2  
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
cursor.execute("SELECT * FROM some_table;")
my_table = cursor.fetchall()

and surprisingly, it causes my Python session to freeze and my computer to crash.

As I use these librairies as "black boxes", I don't understand why something that is so quick in R can be that slow (and thus almost impossible for a practical use) in Python.

Can someone explain this difference in performance, and can someone tell if there exists a more efficient method to fetch a pgSQL table in Python?

R. Bourgeon
  • 923
  • 1
  • 9
  • 25
  • 2
    Isn't `dbGetQuery` similar to `execute` in the sense that no result are fetched when that line is executed? – Guybrush Apr 11 '18 at 08:11
  • This looks like a pretty simple psycopg2 example doing something it should have no problems with. If making this query takes down your whole computer, then I'd start to wonder if your python or psycopg2 module install is corrupted in some way? How does `fetchone()` behave? – match Apr 11 '18 at 08:12
  • @GuyBrush that way of working is done by using `dbSendQuery` and `fetch` instead. – match Apr 11 '18 at 08:13
  • @Guybrush I honestly don't understand your comment. @match I tried again with a SQL query that limits to 20 lines. (it's mySQL and not postgreSQL here, but I don't think that matters much). `fetchall()` works like a charm with the 20-line table, but `fetchone()` returns the following error : `mysql.connector.errors.InternalError: Unread result found` – R. Bourgeon Apr 11 '18 at 08:17

1 Answers1

3

I am no expert in R but very obviously what dbGetQuery() (actually : what dbFetch()) returns is a lazy object that will not load all results in memory - else it would of course take ages and eat all your ram too.

wrt/ Python / psycopg, you definitly DONT want to fetchall() a huge dataset. The proper solution here is to use a server-side cursor and iterate over it.

Edit - answering the questions in your comments:

so the option cursor_factory=psycopg2.extras.DictCursor when executing fetchall()does the trick, right?

Not at all. As written in all letter in the example I likned to, what "does the trick" is using a server side cursor, which is done (in psycopg) by naming the cursor:

HERE IS THE IMPORTANT PART, by specifying a name for the cursor psycopg2 creates a server-side cursor, which prevents all of the records from being downloaded at once from the server.

cursor = conn.cursor('cursor_unique_name')

The DictCursor stuff is actually irrelevant (and should not be mentionned in this example since it obviously confuses newcomers).

I have a side question regarding the concept of lazy object (the one returned in R). How is it possible to return the object as a data-frame without storing it in my RAM? I find it a bit magical.

As I mentionned I don't zilch about R and it's implementation - I deduce that whatever dbFetch returns is a lazy object from the behaviour you describe -, but there's nothing magical in having an object that lazily fetches values from an external source. Python's file object is a known example:

with open("/some/huge/file.txt") as f:
    for line in f:
        print line

In the above snippet, the file object f fetches data from disk only when needed. All that needs to be stored is the file pointer position (and a buffer of the last N bytes that were read from disk, but that's an implementation detail).

If you want to learn more, read about Python's iteratable and iterator.

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • so the option `cursor_factory=psycopg2.extras.DictCursor` when executing `fetchall()`does the trick, right? It's weird because I have already done things like this in VBA and I am 100% sure that I was using a _client_ -side cursor at the time. – R. Bourgeon Apr 11 '18 at 11:55
  • I have a side question regarding the concept of lazy object (the one returned in R). How is it possible to return the object as a data-frame without storing it in my RAM? I find it a bit magical. – R. Bourgeon Apr 11 '18 at 11:57