1

I have SQL Server Database Table in Amazon RDS and I am running a python script on a 8 GB server in AWS EC2. The python code simply select all the data in a large table and tries to convert it into EC2. The EC2 instance quickly runs out of memory even though I am trying to extract the data yearly, however I would like all the data to be extracted into a csv (I don't necessarily need to use Pandas)

As of now the pandas dataframe code is very simple

query= 'select * from table_name'
df = pd.read_sql(query,cnxn)
df.to_csv(target_name, index=False)

The error I am seeing is

Traceback (most recent call last): df = pd.read_sql(query,cnxn)
MemoryError

tawab_shakeel
  • 3,701
  • 10
  • 26
EngineJanwaar
  • 422
  • 1
  • 7
  • 14

2 Answers2

4

You'll want to use your SQL database's native management tools instead of Python/Pandas here.

  • If it's a MySQL database,
    mysql ... --batch --execute='select * from table_name' > my-file.csv
  • If it's a PostgreSQL database, within psql do something like
    \copy (select * from table_name) To './my-file.csv' With CSV
  • If it's SQL Server, (via here)
    sqlcmd -S MyServer -d myDB -E -Q "select * from table_name" -o "my-file.csv" -h-1 -s"," -w 700

If you really do want to use Pandas though, you might be able to get away with the chunksize parameter (adjust accordingly if you're running out of memory):

with open('my_csv.csv', 'w') as f:
    for i, partial_df in enumerate(pd.read_sql(query, cnxn, chunksize=100000)):
        print('Writing chunk %s' % i)
        partial_df.to_csv(f, index=False, header=(i == 0))
AKX
  • 152,115
  • 15
  • 115
  • 172
  • The solution without Pandas seems the right way. I've read some community posts and it seems to fetch the entire data set even with Chunksize to Memory, which is a problem because as soon as my big table is dumped in Memory, I am getting MemoryError. Maybe this has something to do with my DB Driver, not sure about that. But thank you for sqlcmd, this came in Handy. Source: https://gitter.im/pydata/pandas?at=554609295edd84254582fb39 Source: https://github.com/pandas-dev/pandas/issues/10693 TBF, I am surprised the Keyword says Chunksize but doesn't work like it. – EngineJanwaar Jul 08 '19 at 06:14
0

If you can use read_sql with pandas, you certainly have a driver that allows you to directly query the database with a DB-API 2.0 interface, and then write with the csv module one record at a time:

con = ...   # it depends on your currect driver
curs = con.cursor()
curs.execute('select * from table_name')
with open(target_name, 'w', newline='') as fd:
    wr = csv.writer(fd)
    wr.writerow([d[0] for d in curs.description])    # write the header line
    while True:                                      # loop on the cursor
        row = curs.fetchone()
        if not row: break                            # until the end of rows
        wr.writerow(row)                             # and write the row
curs.close()
con.close()
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Thank you for this solution, however seeing as my table is too big this seems like it will take too much time and consume too much server resources writing a few million lines. – EngineJanwaar Jul 08 '19 at 06:05
  • @RakeshGuha: Why do you think that? Whatever the way, the system will write the data. And all decent system (Python is) know to optimize line access at a point where it is close to raw access. And as this solution uses little memory, the system can use more for its own buffering... – Serge Ballesta Jul 08 '19 at 07:45
  • Thank you, I will try this, check the system stats and let you know, I don't have a sandbox environment that's why I didn't want to take a chance, will run this and vmstat to check the more suitable option for me. – EngineJanwaar Jul 08 '19 at 09:00