Python Pandas - MemoryError trying to read big file .txt

Question

i have"MemoryError" when im trying to read file with 45 millions files.

How to solve this problem?

NOTE: My code works for small files

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

from pandas.tools.plotting import scatter_matrix

import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sb


address = 'file.txt'         //File with 45 millions lines = 500 MB
test = pd.read_csv(address)
test.columns = ['Year','Data']

test.boxplot(column='Data', by = 'Year')

plt.show()

This is the error:

Traceback (most recent call last):
  File "plot2.py", line 13, in <module>
    test = pd.read_csv(address)
  File "C:\Users\EYM\Desktop\web_scraping\venv\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
  File "C:\Users\EYM\Desktop\web_scraping\venv\lib\site-packages\pandas\io\parsers.py", line 446, in _read
data = parser.read(nrows)
  File "C:\Users\EYM\Desktop\web_scraping\venv\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
  File "C:\Users\EYM\Desktop\web_scraping\venv\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 919, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 2141, in pandas._libs.parsers._concatenate_chunks
MemoryError

dimension · Answer 1 · 2018-06-29T15:43:00.197

1

use low_memor(default is True) Parameter to false :

>>> test = pd.read_csv(address,sep=" ",low_memor = False)

Try this also:

>>>chunksize = size of chunk
>>> for chunk in pd.read_csv(filename, chunksize=chunksize,low_memor = False):
           process(chunk)

read this blog (https://www.dataquest.io/blog/pandas-big-data/)

edited Jun 29 '18 at 15:43

answered Jun 29 '18 at 15:29

dimension

982
10
18

I use it and it throws me this error: test = pd.read_csv(address,low_memory = False) pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: out of memory – Raul Escalona Jun 29 '18 at 15:31
chunksize = 10 ** 2 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk) – dimension Jun 29 '18 at 15:33
decrese the chunk size – dimension Jun 29 '18 at 15:34
chunksize = 10 ** 6 for chunk in pd.read_csv(address, chunksize=chunksize): process(chunk) SyntaxError: invalid syntax at "for" – Raul Escalona Jun 29 '18 at 15:37
can you try another library like import dask.dataframe as dd – dimension Jun 29 '18 at 15:39
read this (https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) – dimension Jun 29 '18 at 15:40
when i run the code with "chunk" i get this error: process(chunk) NameError: name 'process' is not defined – Raul Escalona Jun 29 '18 at 15:51
from multiprocessing import Process – dimension Jun 29 '18 at 15:59
excuse me, how to import "Process" ? I still cant solve the problem – Raul Escalona Jun 29 '18 at 16:13
first import the module multiprocessing – dimension Jun 29 '18 at 16:15
from multiprocessing import process – dimension Jun 29 '18 at 16:16

score 1 · Answer 2 · answered Jun 29 '18 at 15:32

1

In the URL below you can see the parameters. One of the parameters is chunksize, so you could use that.

chunksize = SOMEVALUE
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

read_csv parameters

answered Jun 29 '18 at 15:32

shawnfunke

53
5

score 0 · Answer 3 · edited Jun 30 '18 at 10:06

You cannot fit that big a DataFrame in memory. There are several ways you can get around it:

First you can parse it the old way, using the csv library, reading the file line by line and writing it to a dictionary. Pandas use optimized structures to store the dataframes into memory which are way heavier than your basic dictionary.

Another way would be to use the nrows (or chunksize) parameter in read_csv to only read parts of the file, and then do your stuff on the dataframes one by one and save them in separate pkl files.

If you only want statistical info about the data, you can just get those and then discard the dataframes. You could also just extract the useful data to get smaller dataframes and then merge them into one dataframe your memory can support.

If you absolutely want the entirety of the dataFrame AND most of the data is numeric, you can optimize memory by using this function:

def reduce_mem(df):
    df = df.apply(pd.to_numeric, errors='ignore', downcast='float')
    df = df.apply(pd.to_numeric, errors='ignore', downcast='integer')
    df = df.apply(pd.to_numeric, errors='ignore', downcast='unsigned')

You still have to read the dataframe by chunks (using chunksize or nrows) but you can then try to merge the chunks if the memory reduction is enough.

Also here's a useful function: df.memory_usage(deep=True) prints the size of the DataFrame

Python Pandas - MemoryError trying to read big file .txt

3 Answers3