I have a small dataframe containing 320k rows and 450 of columns. There are some of lists with column numbers:
list1 = [1,3,5,...]
list2 = [4,9,...]
...
My goal is to replace certain values in each column from current list and then to save it:
df[df[list1] > 7] = np.nan
df[df[list2] >90] = np.nan
...
The size of dataframe made me to do it by chunks:
for chunk in pd.read_csv(filePrev,chunksize=10000,header=None):
>>> chunk[chunk[list1] >= 7] = np.nan
>>> chunk[chunk[list2] >= 90] = np.nan
...
>>> chunk.to_csv(newFile,mode='a',header=False,index=False)
But there was an incorrect work: I've run this code many times, and it mostly couldn't finish working (Memory Errors in IPython or application just closed by Windows), with any chunksize value. But when it finished, it replaced all the values to NaN in most of the strings, and there was some of the strings where all was replaced correctly.
I've tried the same logic on the little piece of the same dataset and it works correctly!
In [11]: df = pd.read_csv(filePrev,nrows=5,usecols=[1,2,3,4,5,6,7],header=None)
In [12]: df
Out[12]:
1 2 3 4 5 6 7
0 1 1 1 1 1 1 1
1 3 1 1 1 2 1 1
2 3 1 1 1 1 1 1
3 3 1 1 1 2 1 2
4 3 1 1 1 1 1 1
In [13]: list = [1,7]
In [14]: df[df[list] > 1] = np.nan
In [15]: df
Out[15]:
1 2 3 4 5 6 7
0 1 1 1 1 1 1 1
1 NaN 1 1 1 2 1 1
2 NaN 1 1 1 1 1 1
3 NaN 1 1 1 2 1 NaN
4 NaN 1 1 1 1 1 1
So, any ideas about it? Can we implement it in "chunking" mode, or there is another way (then I need an example)? I just want to replace certain values to NaN... :)