1

I have a small dataframe containing 320k rows and 450 of columns. There are some of lists with column numbers:

list1 = [1,3,5,...]
list2 = [4,9,...]
...

My goal is to replace certain values in each column from current list and then to save it:

df[df[list1] > 7] = np.nan
df[df[list2] >90] = np.nan
...

The size of dataframe made me to do it by chunks:

for chunk in pd.read_csv(filePrev,chunksize=10000,header=None):
>>>  chunk[chunk[list1] >= 7] = np.nan
>>>  chunk[chunk[list2] >= 90] = np.nan
...
>>>  chunk.to_csv(newFile,mode='a',header=False,index=False)

But there was an incorrect work: I've run this code many times, and it mostly couldn't finish working (Memory Errors in IPython or application just closed by Windows), with any chunksize value. But when it finished, it replaced all the values to NaN in most of the strings, and there was some of the strings where all was replaced correctly.


I've tried the same logic on the little piece of the same dataset and it works correctly!

In [11]: df = pd.read_csv(filePrev,nrows=5,usecols=[1,2,3,4,5,6,7],header=None)

In [12]: df
Out[12]:
   1  2  3  4  5  6  7
0  1  1  1  1  1  1  1
1  3  1  1  1  2  1  1
2  3  1  1  1  1  1  1
3  3  1  1  1  2  1  2
4  3  1  1  1  1  1  1

In [13]: list = [1,7]

In [14]: df[df[list] > 1] = np.nan

In [15]: df
Out[15]:
    1  2  3  4  5  6   7
0   1  1  1  1  1  1   1
1 NaN  1  1  1  2  1   1
2 NaN  1  1  1  1  1   1
3 NaN  1  1  1  2  1 NaN
4 NaN  1  1  1  1  1   1

So, any ideas about it? Can we implement it in "chunking" mode, or there is another way (then I need an example)? I just want to replace certain values to NaN... :)

Chernyavski.aa
  • 1,153
  • 1
  • 9
  • 16
  • It might because the way you're selecting data returns a copy. Try using select http://pandas.pydata.org/pandas-docs/stable/indexing.html – canyon289 May 05 '15 at 01:48

2 Answers2

1

It's possible this could be improved by keeping the file open, rather than opening the file each time in append mode:

with open(newFile, 'a') as f:
    for chunk in pd.read_csv(filePrev,chunksize=10000,header=None):
        chunk[chunk[list1] >= 7] = np.nan
        chunk[chunk[list2] >= 90] = np.nan
        chunk.to_csv(f, header=False, index=False)

Someone reported this behaviour here recently, and this change gave them a 98.3% performance gain on Windows (I only saw around 25% on osx).


If you run your python code with Profile or (ipython's) %prun, you can see what's being called for the most time and the most function calls. In the case of the question I was referring to above, the most time was spend in python's close function (closed after each call of pd.read_csv unless you keep the file open.)


Note: The logic looks ok, you're not assigning to a copy. As you can see in your smaller example: the code works!

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
1

The problem was in code processing certain columns. There was such strings:

chunk[chunk[393] > 50] = np.nan

Instead of

chunk[chunk[[393]] > 50] = np.nan

If there was N:

chunk[393][N] > 50

then all the row transformed to array with NaN


Thank to all for your help.

Chernyavski.aa
  • 1,153
  • 1
  • 9
  • 16