So I'm trying to get a large amount of data from Postgres (on heroku) to disk via Pandas CSV-dump functionality in a low memory environment (the entire table won't fit in memory).
I had thought that I could simply stream it 100 rows at a time directly into a CSV file with the following code:
import psycopg2
import pandas as pd
from sqlalchemy import create_engine
connuri = "MY_DATABASE_CONNECTION_URL"
engine = create_engine(connuri, execution_options={'stream_results': True})
raw_engine = engine.raw_connection()
sql = "SELECT * FROM giant_table;"
firstchunk = True
for chunk in pd.read_sql_query(sql, raw_engine, chunksize=100):
if firstchunk:
chunk.to_csv("bigtable.csv", index=False)
firstchunk = False
else:
chunk.to_csv("bigtable.csv", mode="a", index=False, header=False)
which is mainly based on this answer as well as this one
However, it's still running out of memory.
Judging from the traceback, it appears that it's correctly streaming the data, but that it's running out of memory when trying to write to the file, viz:
Traceback (most recent call last):
File "download_table.py", line 22, in <module>
chunk.to_csv("bigtable.csv", mode="a", index=False, header=False)
File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/core/frame.py", line 1745, in to_csv
formatter.save()
File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 171, in save
self._save()
File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 286, in _save
self._save_chunk(start_i, end_i)
File "/root/.local/share/virtualenvs/check-heroku-k-KgIKz-/lib/python3.5/site-packages/pandas/io/formats/csvs.py", line 313, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 84, in pandas._libs.writers.write_csv_rows
MemoryError
I find this surprising. I would have thought that the append mode would just stick a cursor at the end of the file without necessarily reading the whole thing into memory, then plunk the data into where the cursor is. But maybe the entire file needs to be read to do that (?!).
I've tried reducing the chunksize down to 10, as well as creating the connection with engine.connect()
rather than engine.raw_connection()
in case the problem is that I'm not really streaming data from the db after all. Didn't work either.
I also tried just holding a single file handle open and writing chunk by chunk, as in
with open("attempted_download.csv", "w") as csv:
for chunk in pd.read_sql_query(sql, raw_engine, chunksize=10):
if firstchunk:
mystring = chunk.to_csv(index=False)
csv.write(mystring)
firstchunk = False
else:
mystring = chunk.to_csv(index=False, header=False)
csv.write(mystring)
but same memory error. Am I missing something obvious here?
Edit
I also just tried to save to a bunch of separate files, viz.:
def countermaker():
count = 0
def counter():
nonlocal count
count += 1
return "partial_csv{}.csv".format(count)
return counter
counter = countermaker()
for chunk in pd.read_sql_query(sql, raw_engine, chunksize=10):
if firstchunk:
chunk.to_csv(counter(), index=False)
firstchunk = False
else:
chunk.to_csv(counter(), index=False, header=False)
and got exactly the same error, although it did manage to create some 578 files that way.