3

I have a series of .csv files that I'm reading with pandas.read_csv. From a bunch of columns, I only read 2, (the 2nd and 15th columns).

   datafiles = glob.glob(mypath)
   for dfile in datafiles:
        data = pd.read_csv(dfile,header=6,usecols=['Reading','Value'])

the CSV looks like this, with a few lines of header at the top. Every once in a while pandas reads one of these numbers off as a NaN. Excel has no trouble reading these values, and visually inspecting the file I don't see what causes the problem. Specifically in this case, the row indexed as 265 in this file, 263 in the data frame, the 'Value' column reads a NaN when it should be ~27.4.

    >>>data['Value'][264]
    nan

This problem is consistent doesn't change with the number of files I read. In many of the files, this problem is not present. In the rest, it will only read one random number as a NaN, in either one of the columns. I've tried changing from the automatic float64 to np.float128 using dtype, but this doesn't fix it. Any ideas on how to fix this?

Update: A grep search shows that the newline character is \M with only 4 exceptions--lines at the beginning of every file before the header. On further inspection, this specific point [264] is treated differently in the failing files: In 5/12 files, it's fine. In 2/12 files it's read out as 27.0, in 3/12 it's read out as nan, and in 2/12 files it's read out as 2.0. One of the files (one that reads out a 27.0) is available for download here

k2r
  • 33
  • 5

1 Answers1

2

It looks like you randomly have null characters throughout your csv files, and they are causing the problem. What you need to do to fix this is replace \0 with nothing.

Here's an example of how to do so. The imports are because of loading from a string instead of from a file.

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

datafiles = glob.glob(mypath)
for dfile in datafiles:
    st=''
    with open(dfile,'r') as f:
        for line in f:
            line = line.replace('\0','')
            st += line
    data = pd.read_csv(StringIO(st),header=6,usecols=['Reading','Value'])

It would be cool if pandas had a function to do this by default when you load data into the DataFrame, but it appears that there is no function like that as of now.

Community
  • 1
  • 1
Charlie Haley
  • 4,152
  • 4
  • 22
  • 36
  • Awesome, thanks so much. I actually just found the null characters myself, but I was looking for a bash command line to remove them, but this will work, too! – k2r Feb 18 '16 at 20:00