How to Query and Export Large Data Set in Python Pandas

Question

I have SQL Server Database Table in Amazon RDS and I am running a python script on a 8 GB server in AWS EC2. The python code simply select all the data in a large table and tries to convert it into EC2. The EC2 instance quickly runs out of memory even though I am trying to extract the data yearly, however I would like all the data to be extracted into a csv (I don't necessarily need to use Pandas)

As of now the pandas dataframe code is very simple

query= 'select * from table_name'
df = pd.read_sql(query,cnxn)
df.to_csv(target_name, index=False)

The error I am seeing is

Traceback (most recent call last): df = pd.read_sql(query,cnxn)
MemoryError

score 4 · Accepted Answer · answered Jul 05 '19 at 07:41

You'll want to use your SQL database's native management tools instead of Python/Pandas here.

If it's a MySQL database,
mysql ... --batch --execute='select * from table_name' > my-file.csv
If it's a PostgreSQL database, within psql do something like
\copy (select * from table_name) To './my-file.csv' With CSV
If it's SQL Server, (via here)
sqlcmd -S MyServer -d myDB -E -Q "select * from table_name" -o "my-file.csv" -h-1 -s"," -w 700

If you really do want to use Pandas though, you might be able to get away with the chunksize parameter (adjust accordingly if you're running out of memory):

with open('my_csv.csv', 'w') as f:
    for i, partial_df in enumerate(pd.read_sql(query, cnxn, chunksize=100000)):
        print('Writing chunk %s' % i)
        partial_df.to_csv(f, index=False, header=(i == 0))

The solution without Pandas seems the right way. I've read some community posts and it seems to fetch the entire data set even with Chunksize to Memory, which is a problem because as soon as my big table is dumped in Memory, I am getting MemoryError. Maybe this has something to do with my DB Driver, not sure about that. But thank you for sqlcmd, this came in Handy. Source: https://gitter.im/pydata/pandas?at=554609295edd84254582fb39 Source: https://github.com/pandas-dev/pandas/issues/10693 TBF, I am surprised the Keyword says Chunksize but doesn't work like it. — EngineJanwaar, Jul 08 '19 at 06:14

score 0 · Answer 2 · answered Jul 05 '19 at 08:18

0

If you can use read_sql with pandas, you certainly have a driver that allows you to directly query the database with a DB-API 2.0 interface, and then write with the csv module one record at a time:

con = ...   # it depends on your currect driver
curs = con.cursor()
curs.execute('select * from table_name')
with open(target_name, 'w', newline='') as fd:
    wr = csv.writer(fd)
    wr.writerow([d[0] for d in curs.description])    # write the header line
    while True:                                      # loop on the cursor
        row = curs.fetchone()
        if not row: break                            # until the end of rows
        wr.writerow(row)                             # and write the row
curs.close()
con.close()

answered Jul 05 '19 at 08:18

Serge Ballesta

143,923
11
122
252

Thank you for this solution, however seeing as my table is too big this seems like it will take too much time and consume too much server resources writing a few million lines. – EngineJanwaar Jul 08 '19 at 06:05
@RakeshGuha: Why do you think that? Whatever the way, the system will write the data. And all decent system (Python is) know to optimize line access at a point where it is close to raw access. And as this solution uses little memory, the system can use more for its own buffering... – Serge Ballesta Jul 08 '19 at 07:45
Thank you, I will try this, check the system stats and let you know, I don't have a sandbox environment that's why I didn't want to take a chance, will run this and vmstat to check the more suitable option for me. – EngineJanwaar Jul 08 '19 at 09:00

How to Query and Export Large Data Set in Python Pandas

2 Answers2