0

I have a script that reads some measurement data in CSV form, and then does all kinds of plotting and stuff with it.

Now I have a new dataset, where some idiot deemed it helpful to add some random comments at the end of the line, like so:

01.02.1988 00:00:00   ;   204.94     
01.03.1988 00:00:00   ;   204.87 ; something
01.04.1988 00:00:00   ;   205.41     
01.05.1988 00:00:00   ;   205.64 ; something ; something else    
01.06.1988 00:00:00   ;   205.59 ; also something    
01.07.1988 00:00:00   ;   205.24

which gives me a nice

ValueError: Expected 2 fields in line 36, saw 3

and so on.

According to this and this I have to use the names=['whatever','else'] argument when reading it.

But somehow this goes all kinds of wrong. So here's some examples:

CSV file

Stuff
more stuff I dont need
Date;level;crap1;crap2;crap3;crap4;crap5;crap6
01.01.1988 00:00:00   ;   204.87     
01.02.1988 00:00:00   ;   204.94     
01.03.1988 00:00:00   ;   204.87 

The "nice" header is obviously "handmade", but I should just be able to skip it!?

CSV reader

ValReader = pd.read_csv(csv_list[counter],sep=r'\s*;',skiprows=DateStart,names=['Date','level','crap1','crap2','crap3','crap4','crap5','crap6'],usecols=['Date','level'],index_col='Date',dayfirst=True,parse_dates=True)

What I get

print 'ValReader'
                         level
Date                          
Date                     level
01.04.2003 00:00:00     200.76
01.05.2003 00:00:00     200.64
01.06.2003 00:00:00     200.53

Which following that, causes level to get handled as string.

OK, easy, that manual header line in the CSV (which worked well in a previous version, that only had to handle good data) is the culprit, so I just set skiprows to skiprows=DateStart+1, but that results in

ValueError: Number of passed names did not match number of header fields in the file

So obviously I got utterly lost in how pandas handles the names and positions of columns.

Community
  • 1
  • 1
JC_CL
  • 2,346
  • 6
  • 23
  • 36

1 Answers1

0

I used to have this issue as well, but here is a solution.

One way to resolve it is to NOT use regex to parse the separator as this falls back to python engine, whereas in C engine, you can skip the bad lines warning, and you can specify which columns you want.

For example:

In [1]: import io

In [2]: import pandas as pd

In [3]: s = io.StringIO(u'''Stuff
more stuff I dont need
Date;level;crap1;crap2;crap3;crap4;crap5;crap6
01.01.1988 00:00:00   ;   204.87
01.02.1988 00:00:00   ;   204.94
01.03.1988 00:00:00   ;   204.87 ''')
# I use skiprows=2 instead of DateStart here
# after settings error_bad_lines=False, you can parse the csv OK...
In [4]: ValReader = pd.read_csv(s, sep=';', skiprows=2, usecols=['Date', 'level'], 
                                index_col='Date', dayfirst=True, parse_dates=True, 
                                error_bad_lines=False)

In [5]: ValReader
Out[5]:
             level
Date
1988-01-01  204.87
1988-02-01  204.94
1988-03-01  204.87

In [6]: ValReader['level'].dtype
Out[6]: dtype('float64')

Hope this helps for the issues you have.

Anzel
  • 19,825
  • 5
  • 51
  • 52
  • I forgot why exactly i used regex, but I remember having some issues with the various spaces as padding before and after the `;`. But it seems to work. However now it seems to ignore `na_values='hole'` that I added in, and thus errors out with `ValueError: could not convert string to float: hole`. I will play around a bit more, and see what might have caused it. – JC_CL Jul 02 '15 at 16:50
  • @JC_CL, to my understanding, there are usually ways to eliminate the usage of regex in csv file. For your use case, as long as the required fields do not contain noisy `;`, you can parse the data as is and cleanse the noise afterward. But anyway hope you can figure that out and good luck :) – Anzel Jul 03 '15 at 07:08
  • I am not sure what really changed in my data, or what else I changed in my script, but for now, the regexless `sep` seems to work. I couldnt figure out what the hell happened to my `na_values`, so I did it the brutal way, and changed `hole` to `NaN` with sed in my pre processing. Not elegant, but it seems to work. Off the the next WTF moment then… – JC_CL Jul 03 '15 at 07:20
  • @JC_CL, glad you have it brute-force solved. as a side-note, you can also set `warn_bad_lines=True` <-- I think it's default. And it will show all warning lines when csv is parsed, so yes a workaround will be `sed` the source, not elegant but definitely a way to go. The problem isn't on your end but rather the source **provider** I am afraid ;) – Anzel Jul 03 '15 at 07:36