This is my first question on Stack Overflow, after struggling for an entire day with this issue.
When loading a large .csv file using pandas.read_csv()
with the chunksize
option, I get inconsistent results, as if loops were not completely independent on the data being read from the loop on each iteration. Moreover, data is read and processed correctly only on the first iteration. Here's a simplified example I created that shows this:
import pandas as pd import numpy as np
a = pd.DataFrame(np.random.randn(500, 1), columns=list('A'))
b = pd.DataFrame(np.random.randn(500, 1), columns=list('B'))
c = pd.DataFrame(np.random.randn(500, 1), columns=list('C'))
c.to_csv("./c.csv", index=False, sep="\t")
i = 1
for data in pd.read_csv("./c.csv", delimiter='\t', chunksize = 200):
print("\n\nIteration No.:" + str(i))
print("First five elements of data before concatenation: \n" + repr(data.loc[:5,'C']))
print("First element of a: " + str(a['A'][0]) + ". Type:" + repr(type(a['A'][0])))
print("First element of b: " + str(b['B'][0]) + ". Type:" + repr(type(b['B'][0])))
print("First element of data: " + str(data['C'].iloc[0]) + ". Type:" + repr(type(data['C'].iloc[0])))
data['C'] = a['A'].map(str) + b['B'].map(str) + data['C'].map(str)
print("\n\nFirst five elements of data after concatenation: \n" + repr(data.loc[:5,'C']))
The output of that snippet is the following:
Iteration No.:1
First five elements of data before concatenation:
0 0.272127
1 1.702455
2 0.073175
3 -1.415413
4 0.023546
5 -0.706802
Name: C, dtype: float64
First element of a: -1.28607575146. Type:<class 'numpy.float64'>
First element of b: 0.778682866114. Type:<class 'numpy.float64'>
First element of data: 0.27212690258. Type:<class 'numpy.float64'>
First five elements of data after concatenation:
0 -1.28607575145810920.77868286611354330.2721269...
1 0.242774791222815281.29275536671509881.7024547...
2 0.4524774082028631-1.17833662685619570.0731746...
3 1.4351094358436494-0.5173279482942412-1.415413...
4 -1.7578744077531847-1.59454228118368470.023546...
5 -0.50656599412173-0.3809749686364225-0.7068022...
Name: C, dtype: object
Iteration No.:2
First five elements of data before concatenation:
Series([], Name: C, dtype: float64)
First element of a: -1.28607575146. Type:<class 'numpy.float64'>
First element of b: 0.778682866114. Type:<class 'numpy.float64'>
First element of data: 0.995788479453. Type:<class 'numpy.float64'>
First five elements of data after concatenation:
Series([], Name: C, dtype: object)
Iteration No.:3
First five elements of data before concatenation:
Series([], Name: C, dtype: float64)
First element of a: -1.28607575146. Type:<class 'numpy.float64'>
First element of b: 0.778682866114. Type:<class 'numpy.float64'>
First element of data: -0.188555175182. Type:<class 'numpy.float64'>
First five elements of data after concatenation:
Series([], Name: C, dtype: object)
As you can see, data.loc[:5, 'C'] yields an empty series on the second and third iteration, while data['C'].iloc[0] always yields nonempty values.
I've tried upgrading pandas to the latest version (0.19.2) on Python 3.5.3 . I've also downgraded to Python 2.7.12 with Pandas 0.19.0 and no dice.
Any help will be greatly appreciated. Thank you very much in advance!