0

I try to open a CSV file with pandas, but I'm getting a MemoryError. The file is around 300mb. Everything works fine when I use a smaller file.

I am using windows 10 with 64GB RAM. I already tried to change the custom VM options in Pycharm ("help" >> "Edit custom VM options") and set up higher memory numbers but it still doesn't work

import pandas as pd

df = pd.read_csv('report_OOP_Full.csv')

# I tried to add the following line but doesnt help
# df.info(memory_usage='deep')

MemoryError: Unable to allocate 344. MiB for an array with shape (14, 3216774) and data type float64

Process finished with exit code 1

KenHBS
  • 6,756
  • 6
  • 37
  • 52
DarkWarrior
  • 114
  • 1
  • 11
  • Do you need to import all rows? – Christopher Kinyua Apr 18 '20 at 05:56
  • Yes, if possible! – DarkWarrior Apr 18 '20 at 06:21
  • Do you know what the data types are in your columns? You could try to specify them in `pd.read_csv(fname, dtype=<>)`, that sometimes helps reduce memory usage. The question is different, but the solution may help you anyway: https://stackoverflow.com/questions/49684951/pandas-read-csv-dtype-read-all-columns-but-few-as-string – KenHBS Apr 18 '20 at 08:34
  • Thx Ken! yes I have floats and strings. I will check that link see if it could help! Thank you! – DarkWarrior Apr 18 '20 at 09:15

2 Answers2

1

This may not be the most efficient way but have a go. Reduce or increase the chunk size depending on your RAM availability.

chunks = pd.read_csv('report_OOP_Full.csv', chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
    i += 1
    chunk_list.append(chunk)
    df = pd.concat(chunk_list, sort = True)

If this doesnt work. Try this:

chunks = pd.read_csv('report_OOP_Full.csv', chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
    if i >= 10:
        break
    i += 1
    chunk_list.append(chunk)
    df1 = pd.concat(chunk_list, sort = True)


chunks = pd.read_csv('report_OOP_Full.csv', skiprows = 100000, chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
    if i >= 10:
        break
    i += 1
    chunk_list.append(chunk)
    df2 = pd.concat(chunk_list, sort = True)


d3 = pd.concat([d1,d2], sort = True)

skiprows was calculated by how many rows the previous dataframe has read in.
This will break after 10 chunks is loaded. store this as df1. and read in the file again by starting at chunk 11, and append that again.

i understand that you're working with some big data. I encourage you to take a look at this function i found. The link below explains how it works. credit for this function is here: credit

def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype
    if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
                    df[col] = df[col].astype(np.uint8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
                    df[col] = df[col].astype(np.uint16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
                    df[col] = df[col].astype(np.uint32)                    
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
                elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
                    df[col] = df[col].astype(np.uint64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df

This will make sure your dataframe use as low memory as possible when you're working with it.

HarriS
  • 605
  • 1
  • 6
  • 19
0

I guess another way would be to open only raws which have the same values in the first column ( in this case a string, 1 letter). I dont know if that is possible. for example:

A 4 5 6 3

A 3 4 5 7

A 2 1 4 9

A 1 1 8 7

B 1 2 3 1

B 2 2 3 3

C 1 2 1 2

open first a dataframe with only raws starting with "A" , later do the same with "B" , "C" and so on. I dont know if thats possible but it could help.

DarkWarrior
  • 114
  • 1
  • 11