Large, persistent DataFrame in pandas

Question

I am exploring switching to python and pandas as a long-time SAS user.

However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.

With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.

Is there something analogous in pandas?

I regularly work with large files and do not have access to a distributed computing network.

I'm not familiar with pandas, but you might want to look through iterating through the file. http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk — monkut, Jul 24 '12 at 01:02

score 89 · Answer 1 · edited Mar 14 '23 at 19:43

89

Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:

import pandas as pd
    
# Returns a TextFileReader, which is iterable with chunks of 1000 rows.
csv_iterator = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000)
# Iterate through the dataframe chunks and print one row/record at a time
for chunk in csv_iterator:
    for index, row in chunk.iterrows():
        print(row)

# df is DataFrame. If errors, use `list(csv_iterator)` instead
df = pd.concat(tp, ignore_index=True)

edited Mar 14 '23 at 19:43

Dan Dascalescu

143,271
52
317
404

answered Aug 30 '12 at 08:57

fickludd

1,157
7
11

6

I think you can just do `df = concate(tp, ignore_index=True)` ? – Andy Hayden Jun 24 '13 at 12:24
@smci Tried this quickly with the same data repeated x4 (550 Mb) or x8 (1.1Gb). Interestingly, with or without [x for x in tp], the x4 went through fine, and x8 crashed in a MemoryError. – fickludd Oct 15 '13 at 12:26
@fickludd Are you using ipython / is the dataframe is persisting in memory rather than being garbage collected? – Andy Hayden Oct 15 '13 at 21:10
@AndyHayden Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2. How could the data be garbage collected - I still have it stored in a dataframe? – fickludd Oct 16 '13 at 06:06
@fickludd Sorry not sure I understand the issue then :) – Andy Hayden Oct 16 '13 at 15:53
3

I get this error while using it: `AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "TextFileReader"`. Any idea what is happening here? – Prince Kumar Feb 28 '14 at 23:02
3

This bug will be fixed in 0.14 (release soon), https://github.com/pydata/pandas/pull/6941; workaround for < 0.14.0 is to do ``pd.concat(list(tp), ignore_index=True)`` – Jeff Apr 23 '14 at 16:02
1

what if the values are strings or categorical - i am getting the error: incompatible categories in categorical concat – As3adTintin Jun 22 '15 at 17:26
This was really really helpful. Thanks. Was stuck on it for quite some time. – Run2 Jan 12 '16 at 05:27
why does `ignore_index` need to be true? – SARose Mar 31 '17 at 04:35
HI .. May also request for a solution for a similar problem that I am facing https://stackoverflow.com/questions/46490474/memory-error-performing-sentiment-analysis-large-size-data – Sitz Blogz Oct 01 '17 at 18:02

Wes McKinney · Accepted Answer · 2012-09-09T02:46:50.043

84

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

edited Sep 09 '12 at 02:46

answered Jul 24 '12 at 01:10

Wes McKinney

101,437
32
142
108

1

Say I can read the file and concat all of them together into one DataFrame. Does the DataFrame have to reside in memory? With SAS, I can work with datasets of any size as long as I have the hard-drive space. Is it the same with DataFrames? I get the impression they are constrained by RAM and not hard-drive space. Sorry for the noob question and thanks for you help. I'm enjoying your book. – Zelazny7 Jul 24 '12 at 01:46
4

Right, you're constrained by RAM. SAS indeed has much better support for "out-of-core" big data processing. – Wes McKinney Jul 24 '12 at 04:12
5

@WesMcKinney These workarounds shouldn't be needed any longer, because of the new csv loader you landed in 0.10, right? – Gabriel Grant Jul 29 '13 at 11:44

score 44 · Answer 3 · 2015-02-06T20:46:56.283

This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).

Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:

import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess

# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'

table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration

# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']

# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0]) 

# connect to database
cnx = sqlite3.connect(out_sqlite)

# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):

    df = pd.read_csv(in_csv,  
            header=None,  # no header, define column header manually later
            nrows=chunksize, # number of rows to read at each iteration
            skiprows=i)   # skip rows that were already read

    # columns to read        
    df.columns = columns

    sql.to_sql(df, 
                name=table_name, 
                con=cnx, 
                index=False, # don't use CSV file index
                index_label='molecule_id', # use a unique column from DataFrame as index
                if_exists='append') 
cnx.close()

Super useful to see a realistic use-case for the chunked reading feature. Thanks. — Alex Kestner, Jun 30 '15 at 16:37
Just a small remark, to this old topic: `pandas.read_csv` directly returns (at least on the version I'm currently using) an iterator if you simply provide `iterator=True` and `chunksize=chunksize`. Hence, you would just do a `for` loop over the `pd.read_csv` call, instead of re-instantiating it every time. However, this costs only the call overhead, there maybe no significant impact. — Joël, Dec 08 '15 at 15:19
Hi, Joel. Thanks for the note! The `iterator=True` and `chunksize` parameters already existed back then if I remember correctly. Maybe there was a bug in an older version which caused the memory blow-up -- I will give it another try next time I read a large DataFrame in Pandas (I am mostly using Blaze now for such tasks) — , Dec 08 '15 at 18:57

BENY · Answer 4 · 2017-05-19T21:45:28.970

Below is my working flow.

import sqlalchemy as sa
import pandas as pd
import psycopg2

count = 0
con = sa.create_engine('postgresql://postgres:pwd@localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
                     sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

Base on your file size, you'd better optimized the chunksize.

 for chunk in chunks:
        chunk.to_sql(name='Table', if_exists='append', con=con)
        count += 1
        print(count)

After have all data in Database, You can query out those you need from database.

score 4 · Answer 5 · answered Jun 08 '17 at 21:54

4

If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas

link to dask on github

answered Jun 08 '17 at 21:54

user8108173

41
4

Thanks, since I posted this I've been using dask and the parquet format. – Zelazny7 Jun 09 '17 at 02:32

score 1 · Answer 6 · answered Feb 10 '17 at 17:38

1

You can use Pytable rather than pandas df. It is designed for large data sets and the file format is in hdf5. So the processing time is relatively fast.

answered Feb 10 '17 at 17:38

Elm662

663
1
5
18

Large, persistent DataFrame in pandas

6 Answers6

Linked