1

So I'm trying to get a large amount of data from Postgres (on heroku) to disk via Pandas CSV-dump functionality in a low memory environment (the entire table won't fit in memory).

I had thought that I could simply stream it 100 rows at a time directly into a CSV file with the following code:

import psycopg2
import pandas as pd
from sqlalchemy import create_engine

connuri = "MY_DATABASE_CONNECTION_URL"

engine = create_engine(connuri, execution_options={'stream_results': True})
raw_engine = engine.raw_connection()

sql = "SELECT * FROM giant_table;"

firstchunk = True

for chunk in pd.read_sql_query(sql, raw_engine, chunksize=100):
    if firstchunk:
        chunk.to_csv("bigtable.csv", index=False)
        firstchunk = False
    else:
        chunk.to_csv("bigtable.csv", mode="a", index=False, header=False)

which is mainly based on this answer as well as this one

However, it's still running out of memory.

Judging from the traceback, it appears that it's correctly streaming the data, but that it's running out of memory when trying to write to the file, viz:

Traceback (most recent call last):
  File "download_table.py", line 22, in <module>
    chunk.to_csv("bigtable.csv", mode="a", index=False, header=False)
  File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 171, in save
    self._save()
  File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 286, in _save
    self._save_chunk(start_i, end_i)
  File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 313, in _save_chunk
    self.cols, self.writer)
  File "pandas/_libs/writers.pyx", line 84, in pandas._libs.writers.write_csv_rows
MemoryError

I find this surprising. I would have thought that the append mode would just stick a cursor at the end of the file without necessarily reading the whole thing into memory, then plunk the data into where the cursor is. But maybe the entire file needs to be read to do that (?!).

I've tried reducing the chunksize down to 10, as well as creating the connection with engine.connect() rather than engine.raw_connection() in case the problem is that I'm not really streaming data from the db after all. Didn't work either.

I also tried just holding a single file handle open and writing chunk by chunk, as in

with open("attempted_download.csv", "w") as csv:
    for chunk in pd.read_sql_query(sql, raw_engine, chunksize=10):
        if firstchunk:
            mystring = chunk.to_csv(index=False)
            csv.write(mystring)
            firstchunk = False
        else:
            mystring = chunk.to_csv(index=False, header=False)
            csv.write(mystring)

but same memory error. Am I missing something obvious here?

Edit

I also just tried to save to a bunch of separate files, viz.:

def countermaker():
    count = 0
    def counter():
        nonlocal count
        count += 1
        return "partial_csv{}.csv".format(count)
    return counter

counter = countermaker()

for chunk in pd.read_sql_query(sql, raw_engine, chunksize=10):
    if firstchunk:
        chunk.to_csv(counter(), index=False)
        firstchunk = False
    else:
        chunk.to_csv(counter(), index=False, header=False)

and got exactly the same error, although it did manage to create some 578 files that way.

Paul Gowder
  • 2,409
  • 1
  • 21
  • 36
  • you could try flushing. https://stackoverflow.com/a/5255743/435089 – Kannappan Sirchabesan Jan 20 '19 at 22:09
  • Why on earth use pandas for this? You can simply copy the table with `COPY` command, see [Exporting a PostgreSQL query to a csv file using Python](https://stackoverflow.com/q/49610908/1995738) for example. – klin Jan 20 '19 at 22:19
  • Huh, @klin that... just worked. I was imagining terrible problems with date objects or streaming or something for using postgres directly, but I was wrong. Thanks. (Now to figure out what postgres did to my date objects, multiline string fields, etc...) – Paul Gowder Jan 20 '19 at 22:38
  • 1
    The command guarantee that 1. you'll get a properly formatted CSV, 2. you can read the file with the same command (COPY FROM) – klin Jan 20 '19 at 22:51

0 Answers0