How do I read a large csv file with pandas?

Question

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

...

MemoryError:

Any help on this?

Curiously, a very similar [question](https://stackoverflow.com/questions/23411619/reading-large-text-files-with-pandas) was asked almost a year before this one... — DarkCygnus, Jun 26 '17 at 21:46
Possible duplicate of [Reading large text files with Pandas](https://stackoverflow.com/questions/23411619/reading-large-text-files-with-pandas) — unode, Nov 22 '17 at 17:05
Does this answer your question? ["Large data" work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) — AMC, Mar 16 '20 at 21:09

score 444 · Answer 1 · edited Mar 14 '23 at 19:52

444

The error shows that the machine does not have enough memory to read the entire CSV into a DataFrame at one time. Assuming you do not need the entire dataset in memory all at one time, one way to avoid the problem would be to process the CSV in chunks (by specifying the chunksize parameter):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    # chunk is a DataFrame. To "process" the rows in the chunk:
    for index, row in chunk.iterrows():
        print(row)

The chunksize parameter specifies the number of rows per chunk. (The last chunk may contain fewer than chunksize rows, of course.)

pandas >= 1.2

read_csv with chunksize returns a context manager, to be used like so:

chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        process(chunk)

See GH38225

edited Mar 14 '23 at 19:52

Dan Dascalescu

143,271
52
317
404

answered Sep 21 '14 at 17:54

unutbu

842,883
184
1,785
1,677

26

you generally need 2X the final memory to read in something (from csv, though other formats are better at having lower memory requirements). FYI this is true for trying to do almost anything all at once. Much better to chunk it (which has a constant memory usage). – Jeff Sep 21 '14 at 17:57
By `process(chunk)` you mean creating an empty DF before and then appending `chunk` to it in the for loop, like `DF.append(chunk)`? – altabq Feb 17 '16 at 16:02
31

@altabq: The problem here is that we don't have enough memory to build a single DataFrame holding all the data. The solution above tries to cope with this situation by reducing the chunks (e.g. by aggregating or extracting just the desired information) one chunk at a time -- thus saving memory. Whatever you do, DO NOT call `DF.append(chunk)` inside the loop. That will use `O(N^2)` copying operations. It is better to append the aggregated data *to a list*, and then build the DataFrame from the list with *one call* to `pd.DataFrame` or `pd.concat` (depending on the type of aggregated data). – unutbu Feb 17 '16 at 18:29
15

@altabq: Calling `DF.append(chunk)` in a loop requires `O(N^2)` copying operations where `N` is the size of the chunks, because each call to `DF.append` returns a new DataFrame. Calling `pd.DataFrame` or `pd.concat` **once** outside the loop reduces the amount of copying to `O(N)`. – unutbu Feb 17 '16 at 18:33
Should chunk and row be seen as one and the same, in these reads? – Pyderman May 11 '16 at 17:34
@unutbu The pandas documentation gives the example of a small csv, and with a chunk size of 4, 4 rows get read in per chunk. I'm wondering if this 1-to-1 correspondence holds true for a csv with millions of rows. (I also have 6GB csv that I'm currently trying to read in with a chunk size of 100000, it's taking its time). – Pyderman May 11 '16 at 18:04
5

@Pyderman: Yes, the `chunksize` parameter refers to the number of rows per chunk. The last chunk may contain fewer than `chunksize` rows, of course. – unutbu May 11 '16 at 18:06
2

@Pyderman: You might want to test how long it takes to consume the iterator: `for df in pd.read_csv(..., chunksize=10**5): pass` without doing any processing. If that succeeds quickly, then you know it is something else inside the loop which is taking a lot of time. (Never call `result_df = result_df.append(df)` inside the loop, for example, since that [requires quadratic copying](http://stackoverflow.com/a/36489724/190597).) – unutbu May 11 '16 at 18:18
1

@unutbu Thanks, yes taking your advice from your ealier comment, I have created a `list` rather than a DataFrame, and appending each chunk to this list in each iteration (with a view to creating a DataFrame from the resulting list when all is done). Nothing else is done within the loop. This is what is running right now. That's the approach you were suggesting, correct? – Pyderman May 11 '16 at 18:24
11

@Pyderman: Yes; calling `pd.concat([list_of_dfs])` *once* after the loop is much faster than calling `pd.concat` or `df.append` many times within the loop. Of course, you'll need a considerable amount of memory to hold the entire 6GB csv as one DataFrame. – unutbu May 11 '16 at 18:27
@unutbu : How would you handle this if you did not have enough memory to hold 6 GB Data? I have pytables / storing in a disk however if i had to append it all to a DF before this storage happens that will blow memory. Any suggestions? – CodeGeek123 Jan 09 '17 at 16:33
1

@CodeGeek123: If the chunks can be processed independently, you could use [HDFStore.append](http://stackoverflow.com/q/16997048/190597) to build the pytables result accretively. You might find some useful example code [here too](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#hdfstore). – unutbu Jan 09 '17 at 20:28
3

just following on from Jeff's comment: In my case, I need more than ~3x memory: I'm loading a 19gb csv on a 64 bit OS and have 64gb of ram, the vast majority of which is free. Using read_csv, I rapidly run out of memory and my M2 SSD disk swap file starts being used. After about ~10 - 15 minutes, my screen goes black and I have to reboot. I'm definitely not overheating. I believe the memory overheads are due to panda's column type detection, and my file has ~100+ columns. – Carl Sep 05 '17 at 08:17
1

Here is a follow-up question: there is an option to load only certain columns, that reduces both CPU and memory overheads associated with dtype detection, which I believe are quite substantial. Therefore, if you don't need all the columns in the csv, then this could also solve your problem. – Carl Sep 05 '17 at 13:08
how would this work if i had a bunch of parquet files? – femi May 21 '20 at 22:58
how with this work if one is loading parquet files instead of csv ? parquet files cant be split. thanks – femi Jun 07 '20 at 10:10
@jtbandes: i am trying to use same: > chunksize = 10 ** 6 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk), Where do we define "process" ? or any other way except append and concat ? – Jun 23 '20 at 11:26
“process” is not a real function. That’s where you would modify the code to do whatever you want with the chunk. The key suggestion here is to specify chunksize to read_csv so it gives you results in chunks rather than all at once. – jtbandes Jun 23 '20 at 15:38
@femi if you’re having trouble with a different format, you should post a new question – jtbandes Jun 23 '20 at 15:39
@jtbandes: what *exactly* is a chunk? Some code "processing" it in a simple way (e.g. print all rows) would greatly increase the value of this answer. – Dan Dascalescu Mar 14 '23 at 19:07
Not sure why I'm being tagged as I'm not the author of this answer, but I'd recommend you follow the documentation links in the answer, from which you can find more information about chunks. Feel free to suggest an [edit] if you can improve the answer, or post a new one! – jtbandes Mar 14 '23 at 19:21

score 137 · Answer 2 · edited Jun 20 '20 at 09:12

137

Chunking shouldn't always be the first port of call for this problem.

Is the file large due to repeated non-numeric data or unwanted columns?

If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.
Does your workflow require slicing, manipulating, exporting?

If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.

Chunk via pandas or via csv library as a last resort.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 23 '18 at 17:45

jpp

159,742
34
281
339

It looks like `chunks` have the same meaning of "the number of lines", right? – Belter Jan 04 '22 at 08:00
1

@Belter, ..yes. – jpp Jan 04 '22 at 22:46

Simbarashe Timothy Motsi · Answer 3 · 2021-11-03T10:34:02.507

69

For large data l recommend you use the library "dask"
e.g:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

You can read more from the documentation here.

Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.

From my projects another superior library is datatables.

# Datatable python library
import datatable as dt
df = dt.fread("s3://.../2018-*-*.csv")

edited Nov 03 '21 at 10:34

answered Apr 17 '18 at 11:21

Simbarashe Timothy Motsi

1,469
14
28

19

Any benefits over pandas, could appreciate adding a few more pointers – PirateApp Apr 21 '18 at 11:38
2

I haven't used Dask for very long but the main advantages in my use cases were that Dask can run parallel on multiple machines, it can also fit data as slices into memory. – Simbarashe Timothy Motsi Apr 23 '18 at 07:42
3

thanks! is dask a replacement for pandas or does it work on top of pandas as a layer – PirateApp Apr 28 '18 at 12:30
3

Welcome, it works as a wrapper for Numpy, Pandas, and Scikit-Learn. – Simbarashe Timothy Motsi Apr 28 '18 at 12:35
2

I've tried to face several problems with Dask and always throws an error for everything. Even with chunks It throws Memory errors too. See https://stackoverflow.com/questions/59865572/dask-running-out-of-memory-even-with-chunks?noredirect=1#comment105870267_59865572 – Genarito Jan 24 '20 at 14:43
Thanks a lot, can you explain how I can save in a sensible way a large dask table to csv? I just used `.to_csv(...)` and it produced a tonne of `.part` files instead of a csv file? Do I have to turn the dask table back to pandas df before saving? tnx – NeStack Nov 17 '22 at 15:40

score 40 · Answer 4 · edited Aug 03 '17 at 18:56

40

I proceeded like this:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

edited Aug 03 '17 at 18:56

cs95

379,657
97
704
746

answered Sep 24 '14 at 12:46

Rajkumar Kumawat

3,627
3
12
8

36

Is there a reason you switched from `read_csv` to `read_table`? – Pyderman May 09 '16 at 22:06

score 12 · Answer 5 · edited Jun 11 '19 at 20:57

You can read in the data as chunks and save each chunk as pickle.

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

In the next step you read in the pickles and append each pickle to your desired dataframe.

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

If your final `df` fits entirely in memory (as implied) and contains the same amount of data as your input, surely you don't need to chunk at all? — jpp, Feb 19 '19 at 09:15
You would need to chunk in this case if, for example, your file is very wide (like greater than 100 columns with a lot of string columns). This increases the memory needed to hold the df in memory. Even a 4GB file like this could end up using between 20 and 30 GB of RAM on a box with 64 GB RAM. — cdabel, Oct 14 '19 at 23:20

null · Answer 6 · 2020-04-30T10:34:04.427

I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.

Option 1: dtypes

"dtypes" is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.

Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):

The maximum value of UNSIGNED CHAR = 255                                    
The minimum value of SHORT INT = -32768                                     
The maximum value of SHORT INT = 32767                                      
The minimum value of INT = -2147483648                                      
The maximum value of INT = 2147483647                                       
The minimum value of CHAR = -128                                            
The maximum value of CHAR = 127                                             
The minimum value of LONG = -9223372036854775808                            
The maximum value of LONG = 9223372036854775807

Refer to this page to see the matching between NumPy and C types.

Let's say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).

Observe the following dtype map.

Source: https://pbpython.com/pandas_dtypes.html

You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.

import numpy as np
import pandas as pd

df_dtype = {
        "column_1": int,
        "column_2": str,
        "column_3": np.int16,
        "column_4": np.uint8,
        ...
        "column_n": np.float32
}

df = pd.read_csv('path/to/file', dtype=df_dtype)

Option 2: Read by Chunks

Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It'd be much better if you combine this option with the first one, dtypes.

I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;

Option 3: Dask

Dask is a framework that is defined in Dask's website as:

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.

You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).

Other Aids (Ideas)

ETL flow designed for the data. Keeping only what is needed from the raw data.
- First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
- Then see if the processed data can be fit in the memory as a whole.
Consider increasing your RAM.
Consider working with that data on a cloud platform.

score 7 · Answer 7 · answered Mar 18 '20 at 19:57

Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by @unutbu you can simply use nrows option.

small_df = pd.read_csv(filename, nrows=100)

Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.

score 6 · Answer 8 · edited Jun 10 '17 at 18:29

6

The function read_csv and read_table is almost the same. But you must assign the delimiter “，” when you use the function read_table in your program.

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

edited Jun 10 '17 at 18:29

Ron

1,249
9
20

answered Apr 26 '17 at 15:02

Tyrion W

67
1
2

It would help if stated what your question is in this post. Like "What is the difference between read_csv and read_table?" or "Why does read table need a delimiter?" – nate_weldon Apr 26 '17 at 15:44
1

It depends how your file looks. Some files have common delimiters such as "," or "|" or "\t" but you may see other files with delimiters such as 0x01, 0x02 (making this one up) etc. So read_table is more suited to uncommon delimiters but read_csv can do the same job just as good. – Naufal Jun 10 '18 at 18:46

score 5 · Answer 9 · edited Jan 02 '19 at 12:04

5

Solution 1:

Using pandas with large data

Solution 2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)

edited Jan 02 '19 at 12:04

petezurich

9,280
9
43
57

answered Dec 05 '18 at 08:25

blacksheep

483
5
7

3

Here again we are loading the 6 GB file totally to the memory, Is there any options, we can process the current chunk and then read the next chunk – Debashis Sahoo Dec 20 '18 at 09:08
6

just don't do `dfList.append`, just process each chunk ( `df` ) separately – gokul_uf Dec 20 '18 at 14:13

score 3 · Answer 10 · answered May 27 '19 at 06:12

Here follows an example:

chunkTemp = []
queryTemp = []
query = pd.DataFrame()

for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):

    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})

    #YOU CAN EITHER: 
    #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET 
    chunkTemp.append(chunk)

    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]   
    #BUFFERING PROCESSED DATA
    queryTemp.append(query)

#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")

#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)

score 2 · Answer 11 · answered Jan 07 '17 at 13:22

2

You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.

answered Jan 07 '17 at 13:22

nunodsousa

2,635
4
27
49

Link to SFrame docs: https://turi.com/products/create/docs/generated/graphlab.SFrame.html – ankostis Mar 23 '17 at 13:36
"The data in SFrame is stored column-wise on the GraphLab Server side" is it a service or a package? – Danny Wang Sep 06 '17 at 15:06

score 2 · Answer 12 · answered Nov 13 '17 at 05:34

If you use pandas read large file into chunk and then yield row by row, here is what I have done

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))

score 1 · Answer 13 · answered Apr 11 '19 at 04:57

1

In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here's a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.

import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)

answered Apr 11 '19 at 04:57

Jaskaran

11
2

Can you comment on how this new module `modin` compares with the well-established [`dask.dataframe`](http://docs.dask.org/en/latest/dataframe.html)? For example, see [move from pandas to dask to utilize all local cpu cores](https://stackoverflow.com/questions/42649234/move-from-pandas-to-dask-to-utilize-all-local-cpu-cores). – jpp Apr 12 '19 at 18:17

score 1 · Answer 14 · answered Jul 31 '21 at 16:09

1

If you have `csv` file with `millions` of data entry and you want to load full dataset you should use `dask_cudf`,

import dask_cudf as dc

df = dc.read_csv("large_data.csv")

answered Jul 31 '21 at 16:09

Sudhanshu

704
1
9
24

score 0 · Answer 15 · answered Oct 14 '18 at 22:44

In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

score 0 · Answer 16 · answered Aug 13 '23 at 10:10

def read_csv_with_progress(file_path, sep):
    import pandas as pd
    from tqdm import tqdm

    chunk_size = 50000  # Number of lines to read in each iteration

    # Get the total number of lines in the CSV file
    print("Calculating average line length + getting file size")
    counter = 0
    total_length = 0
    num_to_sample = 10
    for line in open(file_path, 'r'):
        counter += 1
        if counter > 1:
            total_length += len(line)
        if counter == num_to_sample + 1:
            break
    file_size = os.path.getsize(file_path)
    avg_line_length = total_length / num_to_sample
    avg_number_of_lines = int(file_size / avg_line_length)

    chunks = []
    with tqdm(total=avg_number_of_lines, desc='Reading CSV') as pbar:
        for chunk in pd.read_csv(file_path, chunksize=chunk_size, low_memory=False, sep=sep):
            chunks.append(chunk)
            pbar.update(chunk.shape[0])

    print("Concating...")
    df = pd.concat(chunks, ignore_index=True)
    return df

How do I read a large csv file with pandas?

16 Answers16

pandas >= 1.2

If you have `csv` file with `millions` of data entry and you want to load full dataset you should use `dask_cudf`,

Linked

Related

How do I read a large csv file with pandas?

16 Answers16

pandas >= 1.2

If you have csv file with millions of data entry and you want to load full dataset you should use dask_cudf,

Linked

Related

If you have `csv` file with `millions` of data entry and you want to load full dataset you should use `dask_cudf`,