100

I am exploring switching to python and pandas as a long-time SAS user.

However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.

With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.

Is there something analogous in pandas?

I regularly work with large files and do not have access to a distributed computing network.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • I'm not familiar with pandas, but you might want to look through iterating through the file. http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk – monkut Jul 24 '12 at 01:02

6 Answers6

89

Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:

import pandas as pd
    
# Returns a TextFileReader, which is iterable with chunks of 1000 rows.
csv_iterator = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000)
# Iterate through the dataframe chunks and print one row/record at a time
for chunk in csv_iterator:
    for index, row in chunk.iterrows():
        print(row)

# df is DataFrame. If errors, use `list(csv_iterator)` instead
df = pd.concat(tp, ignore_index=True)
Dan Dascalescu
  • 143,271
  • 52
  • 317
  • 404
fickludd
  • 1,157
  • 7
  • 11
  • 6
    I think you can just do `df = concate(tp, ignore_index=True)` ? – Andy Hayden Jun 24 '13 at 12:24
  • @smci Tried this quickly with the same data repeated x4 (550 Mb) or x8 (1.1Gb). Interestingly, with or without [x for x in tp], the x4 went through fine, and x8 crashed in a MemoryError. – fickludd Oct 15 '13 at 12:26
  • @fickludd Are you using ipython / is the dataframe is persisting in memory rather than being garbage collected? – Andy Hayden Oct 15 '13 at 21:10
  • @AndyHayden Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2. How could the data be garbage collected - I still have it stored in a dataframe? – fickludd Oct 16 '13 at 06:06
  • @fickludd Sorry not sure I understand the issue then :) – Andy Hayden Oct 16 '13 at 15:53
  • 3
    I get this error while using it: `AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "TextFileReader"`. Any idea what is happening here? – Prince Kumar Feb 28 '14 at 23:02
  • 3
    This bug will be fixed in 0.14 (release soon), https://github.com/pydata/pandas/pull/6941; workaround for < 0.14.0 is to do ``pd.concat(list(tp), ignore_index=True)`` – Jeff Apr 23 '14 at 16:02
  • 1
    what if the values are strings or categorical - i am getting the error: incompatible categories in categorical concat – As3adTintin Jun 22 '15 at 17:26
  • This was really really helpful. Thanks. Was stuck on it for quite some time. – Run2 Jan 12 '16 at 05:27
  • why does `ignore_index` need to be true? – SARose Mar 31 '17 at 04:35
  • HI .. May also request for a solution for a similar problem that I am facing https://stackoverflow.com/questions/46490474/memory-error-performing-sentiment-analysis-large-size-data – Sitz Blogz Oct 01 '17 at 18:02
84

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • 1
    Say I can read the file and concat all of them together into one DataFrame. Does the DataFrame have to reside in memory? With SAS, I can work with datasets of any size as long as I have the hard-drive space. Is it the same with DataFrames? I get the impression they are constrained by RAM and not hard-drive space. Sorry for the noob question and thanks for you help. I'm enjoying your book. – Zelazny7 Jul 24 '12 at 01:46
  • 4
    Right, you're constrained by RAM. SAS indeed has much better support for "out-of-core" big data processing. – Wes McKinney Jul 24 '12 at 04:12
  • 5
    @WesMcKinney These workarounds shouldn't be needed any longer, because of the new csv loader you landed in 0.10, right? – Gabriel Grant Jul 29 '13 at 11:44
44

This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).

Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:

import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess

# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'

table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration

# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']

# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0]) 

# connect to database
cnx = sqlite3.connect(out_sqlite)

# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):

    df = pd.read_csv(in_csv,  
            header=None,  # no header, define column header manually later
            nrows=chunksize, # number of rows to read at each iteration
            skiprows=i)   # skip rows that were already read

    # columns to read        
    df.columns = columns

    sql.to_sql(df, 
                name=table_name, 
                con=cnx, 
                index=False, # don't use CSV file index
                index_label='molecule_id', # use a unique column from DataFrame as index
                if_exists='append') 
cnx.close()    
  • 4
    Super useful to see a realistic use-case for the chunked reading feature. Thanks. – Alex Kestner Jun 30 '15 at 16:37
  • 6
    Just a small remark, to this old topic: `pandas.read_csv` directly returns (at least on the version I'm currently using) an iterator if you simply provide `iterator=True` and `chunksize=chunksize`. Hence, you would just do a `for` loop over the `pd.read_csv` call, instead of re-instantiating it every time. However, this costs only the call overhead, there maybe no significant impact. – Joël Dec 08 '15 at 15:19
  • 1
    Hi, Joel. Thanks for the note! The `iterator=True` and `chunksize` parameters already existed back then if I remember correctly. Maybe there was a bug in an older version which caused the memory blow-up -- I will give it another try next time I read a large DataFrame in Pandas (I am mostly using Blaze now for such tasks) –  Dec 08 '15 at 18:57
6

Below is my working flow.

import sqlalchemy as sa
import pandas as pd
import psycopg2

count = 0
con = sa.create_engine('postgresql://postgres:pwd@localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
                     sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

Base on your file size, you'd better optimized the chunksize.

 for chunk in chunks:
        chunk.to_sql(name='Table', if_exists='append', con=con)
        count += 1
        print(count)

After have all data in Database, You can query out those you need from database.

BENY
  • 317,841
  • 20
  • 164
  • 234
4

If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas

link to dask on github

1

You can use Pytable rather than pandas df. It is designed for large data sets and the file format is in hdf5. So the processing time is relatively fast.

Elm662
  • 663
  • 1
  • 5
  • 18