1

I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail.

  • It didn't work when I used the iterator=True and chunksize=number parameters. Moreover, the smaller the chunksize, the slower it is to process the same amount of data.
    • (Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.)
  • Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError.
    • (I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.)

The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.

import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
    chunk = pd.read_csv(filename)
    df = chunk.iloc[:,[5,8,15,16]]
    df = df.dropna(how='any')
    counter += 1
    print(counter)
smci
  • 32,567
  • 20
  • 113
  • 146
Meng zhao
  • 201
  • 1
  • 6
  • 3
    It looks like you only want to read certain columns. If that's the case, look at the `usecols` argument to `read_csv`. – BrenBarn Mar 18 '16 at 08:09
  • You might want to try this answer: http://stackoverflow.com/a/32248310/5276797 – IanS Mar 18 '16 at 08:52
  • First, use `usecols` with column names to only read in your desired columns, so you don't waste Gigabytes on unwanted columns you're only going to drop anyway, then kill the `chunk.iloc` line. Second, `use `dtypes` argument for each column to specify its type, instead of string which wastes memory. Third, I doubt your claim `chunksize=number` was no use and made things slower; tell us the actual specific `chunksize` values you tried, and what timings you got (after you use `usecols` and `dtypes`). – smci Mar 10 '19 at 03:56

3 Answers3

1

you may try to parse only those columns that you need (as @BrenBarn said in comments):

import os
import glob
import pandas as pd

def get_merged_csv(flist, **kwargs):
    return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)

fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]

df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')

print(df.head())

PS this will include only 4 out of at least 17 columns in your resulting data frame

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

Thanks for the reply.

After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found here

Meng zhao
  • 201
  • 1
  • 6
  • Please accept this answer of yours, and also edit a clarification line into your question. Otherwise this question is misleading to people generally looking for how to reduce memory usage with `pd.read_csv`. You should always use `usecols` (and `dtypes`) parameters during `pd.read_csv` whnever possible, and never rely on `.iloc[..]` after-the-fact to slice selected columns, and expect garbage collection to recover the wasted gigabytes. – smci Mar 10 '19 at 04:00
0

I have found same issues in csv file. First to make csv as chunks and fix the chunksize.use the chunksize or iterator parameter to return the data in chunks. Syntax:

csv_onechunk = padas.read_csv(filepath, sep = delimiter, skiprows = 1, chunksize = 10000)

then concatenate the chunks (Only valid with C parser)

Bugs
  • 4,491
  • 9
  • 32
  • 41