0

This is my first question on Stack Overflow, after struggling for an entire day with this issue.
When loading a large .csv file using pandas.read_csv() with the chunksize option, I get inconsistent results, as if loops were not completely independent on the data being read from the loop on each iteration. Moreover, data is read and processed correctly only on the first iteration. Here's a simplified example I created that shows this:

import pandas as pd import numpy as np

 a = pd.DataFrame(np.random.randn(500, 1), columns=list('A'))
 b = pd.DataFrame(np.random.randn(500, 1), columns=list('B'))
 c = pd.DataFrame(np.random.randn(500, 1), columns=list('C'))
 c.to_csv("./c.csv", index=False, sep="\t")

 i = 1 

 for data in pd.read_csv("./c.csv", delimiter='\t', chunksize = 200):


         print("\n\nIteration No.:" + str(i))
         print("First five elements of data before concatenation: \n" + repr(data.loc[:5,'C']))

         print("First element of a: " + str(a['A'][0]) + ". Type:" + repr(type(a['A'][0])))
         print("First element of b: " + str(b['B'][0]) + ". Type:" + repr(type(b['B'][0])))
         print("First element of data: " + str(data['C'].iloc[0]) + ". Type:" + repr(type(data['C'].iloc[0])))

         data['C'] =  a['A'].map(str) +  b['B'].map(str) + data['C'].map(str)
         print("\n\nFirst five elements of data after concatenation: \n" + repr(data.loc[:5,'C']))

The output of that snippet is the following:

   Iteration No.:1
   First five elements of data before concatenation: 
   0    0.272127
   1    1.702455
   2    0.073175
   3   -1.415413
   4    0.023546
   5   -0.706802
   Name: C, dtype: float64
   First element of a: -1.28607575146. Type:<class 'numpy.float64'>
   First element of b: 0.778682866114. Type:<class 'numpy.float64'>
   First element of data: 0.27212690258. Type:<class 'numpy.float64'>


   First five elements of data after concatenation: 
   0    -1.28607575145810920.77868286611354330.2721269...
   1    0.242774791222815281.29275536671509881.7024547...
   2    0.4524774082028631-1.17833662685619570.0731746...
   3    1.4351094358436494-0.5173279482942412-1.415413...
   4    -1.7578744077531847-1.59454228118368470.023546...
   5    -0.50656599412173-0.3809749686364225-0.7068022...
   Name: C, dtype: object


   Iteration No.:2
   First five elements of data before concatenation: 
   Series([], Name: C, dtype: float64)
   First element of a: -1.28607575146. Type:<class 'numpy.float64'>
   First element of b: 0.778682866114. Type:<class 'numpy.float64'>
   First element of data: 0.995788479453. Type:<class 'numpy.float64'>


   First five elements of data after concatenation: 
   Series([], Name: C, dtype: object)


   Iteration No.:3
   First five elements of data before concatenation: 
   Series([], Name: C, dtype: float64)
   First element of a: -1.28607575146. Type:<class 'numpy.float64'>
   First element of b: 0.778682866114. Type:<class 'numpy.float64'>
   First element of data: -0.188555175182. Type:<class 'numpy.float64'>


   First five elements of data after concatenation: 
   Series([], Name: C, dtype: object)           

As you can see, data.loc[:5, 'C'] yields an empty series on the second and third iteration, while data['C'].iloc[0] always yields nonempty values.

I've tried upgrading pandas to the latest version (0.19.2) on Python 3.5.3 . I've also downgraded to Python 2.7.12 with Pandas 0.19.0 and no dice.

Any help will be greatly appreciated. Thank you very much in advance!

  • `read_csv()` is usually one of the faster options out there. how big is your file and have you tried running it without `chunksize`? also [this](http://stackoverflow.com/questions/33642951/python-using-pandas-structures-with-large-csviterate-and-chunksize) may help you. Might be your problem. Essentially, you might have to `concat` your chunks. – MattR Mar 08 '17 at 21:11
  • 5
    What are the inconsistent results? Try replacing `data.loc[:5, 'C']` with `data.iloc[:5]['C']`? Using `data.loc[:5, 'C']` returns an empty DataFrame on the second and third iteration because pandas keeps track of the preceding index, so the index of `data` at second iteration starts with 200. – ostrokach Mar 08 '17 at 21:18
  • @ostrokach I've updated the post with the output and your suggestion fixes the issue in the example. However, in my real life case, I still get NaN as an output of the string concatenation. – Diego Llarrull Mar 09 '17 at 12:57

0 Answers0