1

i have"MemoryError" when im trying to read file with 45 millions files.

How to solve this problem?

NOTE: My code works for small files

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

from pandas.tools.plotting import scatter_matrix

import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sb


address = 'file.txt'         //File with 45 millions lines = 500 MB
test = pd.read_csv(address)
test.columns = ['Year','Data']

test.boxplot(column='Data', by = 'Year')

plt.show()

This is the error:

Traceback (most recent call last):
  File "plot2.py", line 13, in <module>
    test = pd.read_csv(address)
  File "C:\Users\EYM\Desktop\web_scraping\venv\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
  File "C:\Users\EYM\Desktop\web_scraping\venv\lib\site-packages\pandas\io\parsers.py", line 446, in _read
data = parser.read(nrows)
  File "C:\Users\EYM\Desktop\web_scraping\venv\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
  File "C:\Users\EYM\Desktop\web_scraping\venv\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 919, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 2141, in pandas._libs.parsers._concatenate_chunks
MemoryError
Raul Escalona
  • 117
  • 1
  • 10

3 Answers3

1

use low_memor(default is True) Parameter to false :

>>> test = pd.read_csv(address,sep=" ",low_memor = False)

Try this also:

>>>chunksize = size of chunk
>>> for chunk in pd.read_csv(filename, chunksize=chunksize,low_memor = False):
           process(chunk)

read this blog (https://www.dataquest.io/blog/pandas-big-data/)

dimension
  • 982
  • 10
  • 18
1

In the URL below you can see the parameters. One of the parameters is chunksize, so you could use that.

chunksize = SOMEVALUE
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

read_csv parameters

shawnfunke
  • 53
  • 5
0

You cannot fit that big a DataFrame in memory. There are several ways you can get around it:

First you can parse it the old way, using the csv library, reading the file line by line and writing it to a dictionary. Pandas use optimized structures to store the dataframes into memory which are way heavier than your basic dictionary.

Another way would be to use the nrows (or chunksize) parameter in read_csv to only read parts of the file, and then do your stuff on the dataframes one by one and save them in separate pkl files.

If you only want statistical info about the data, you can just get those and then discard the dataframes. You could also just extract the useful data to get smaller dataframes and then merge them into one dataframe your memory can support.

If you absolutely want the entirety of the dataFrame AND most of the data is numeric, you can optimize memory by using this function:

def reduce_mem(df):
    df = df.apply(pd.to_numeric, errors='ignore', downcast='float')
    df = df.apply(pd.to_numeric, errors='ignore', downcast='integer')
    df = df.apply(pd.to_numeric, errors='ignore', downcast='unsigned')

You still have to read the dataframe by chunks (using chunksize or nrows) but you can then try to merge the chunks if the memory reduction is enough.

Also here's a useful function: df.memory_usage(deep=True) prints the size of the DataFrame

Gerry L.
  • 21
  • 3
Zuma
  • 806
  • 1
  • 7
  • 10