1

I am doing various analyses and plots on datasets.

I get my datasets as CSV files from a government website. depending on the type of data, the CSV files are slightly different:

  • length of the header
  • contents of the header
  • number of columns
  • number of NaN values
  • montly or daily data
  • magnitude of the values

As a start, I read the first 50 rows with pythons standard CSV reader, to search for various strings that I need later and to find out how long the header actually is.

The next step is replacing some things with sed (changing the names for the columns, exchanging , with . as the decimal).

Then I read it with

Mydata=pd.read_csv(csv_list[counter],sep=';',skiprows=DataStart,index_col='Date',usecols=0,1],dayfirst=True,parse_dates=True,na_values='L\xfccke')

And now comes my Issue - with the following two CSV, the first works, and the second doesn't:

This:

...20 more lines of header
Werteformat:               ;1 Nachkommast.
Date;level
01.01.1971 07:00:00   ;     0.0
02.01.1971 07:00:00   ;     0.0
03.01.1971 07:00:00   ;     0.0
...15000 lines of data
01.01.2012 07:00:00   ;Lücke

works, whereas this

...30 more lines of header
Werteformat:               ;2 Nachkommast.
Date;level;
01.01.1970 00:00:00   ;   427.27 ;     
01.02.1970 00:00:00   ;   427.80 ;     
...500 lines of data, with stuff like
01.03.1973 00:00:00   ;Lücke     ;     
in between

errors out when trying to plot a histogram with

Traceback (most recent call last):
  File "plotter.py", line 179, in <module>
    plt.hist(Jan_RV)
  File "/usr/lib64/python2.7/site-packages/matplotlib/pyplot.py", line      2827, in hist
         stacked=stacked, **kwargs)
   File "/usr/lib64/python2.7/site-packages/matplotlib/axes.py", line 8326, in hist
     m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
   File "/usr/lib64/python2.7/site-packages/numpy/lib/function_base.py",      line 176, in histogram
     mn, mx = [mi+0.0 for mi in range]
 TypeError: cannot concatenate 'str' and 'float' objects

OK, so it seems like something gets read as strings, and by running some parts in ipython, I found out that it is the Lücke part, that they choose to use as NaN or missing data representer. I thought I had taken care of that with the na_values='L\xfccke' in read_csv, and with the first dataset, it does work. If I look at MyData in ipython, I see NaN instead of Lücke, whereas with the second dataset, the Lücke stays.

[Edit] The suspicion that the umlaut ü is causing the issue might be wrong? I changed Lücke to NaN by hand, and got rid of the na_values='L\xfccke' in read_csv and it still keeps the text parts (now NaN) as strings. I also have

Rlength=len(MyData)   #counts everything, including NaN
Rcount=MyData.count() #counts only valid numbers
NaN_Number=Rlength-Rcount

in my script, and with the second dataset, it always results in 0 NaN_Numbers [/edit]

What could be the reason for that? Does na_values only take the first value, and then stops? Does the fact that i have an empty column in the second dataset cause issues? But that shouldn't matter, since I only take the first two, with usecols=[0,1].

Another obvious difference is the fact that the first set is daily data, and the second monthly, but that shouldnt matter. I have some other monthly data, that does work.

One Idea I had, was that the CSV files might have different line endings (Is that the correct term? Unix v. Windows, or carriage return v. line feed) or encodings, but looking at that in my editor, it tells me they are the same. And stat filename.csv in terminal, also gets me quite similar results.

So I am utterly lost.

Edit2:* Short sample CSV file:

Werteformat:               ;2 Nachkommast.
Date ;level ;
01.10.1982 00:00:00   ;   873.33 ;     
01.11.1982 00:00:00   ;   873.19 ;     
01.12.1982 00:00:00   ;Lücke     ;     
01.01.1983 00:00:00   ;Lücke     ;     
01.02.1983 00:00:00   ;   873.17 ;     

read in with (Some changes from @mhawke's answer already included):

Tester2=pd.read_csv('test2.csv',sep=r'\s*;',skiprows=1,index_col='Date',dayfirst=True,parse_dates=True,na_values='Lücke',usecols=[0,1])

results in

In [598]: Tester2
Out[598]: 
                level  Unnamed: 2
Date                             
1982-10-01     873.33         NaN
1982-11-01     873.19         NaN
1982-12-01        NaN         NaN
1983-01-01        NaN         NaN
1983-02-01     873.17         NaN
JC_CL
  • 2,346
  • 6
  • 23
  • 36
  • I think the `read_csv` argument `na_values` might *replace* missing values with what you choose, and not guarantee that that string is going to be *interpreted* as NaN. Maybe you could try looking into providing a `converter` function for that particular column that deals with Lücke and missing values in a controlled way explicitly? – vmg Feb 12 '15 at 11:09
  • According to the [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/io.html#na-values) `na_values` accepts as NaN what I tell it too: " `read_csv(path, na_values=["Nope"])` the default values, in addition to the string `"Nope"` are recognized as `NaN` " – JC_CL Feb 12 '15 at 11:17
  • Well, I just tested and seems `na_values='L\xfccke'` wont cause "Lücke" to be interpreted as NaN. It might be a codification problem. I tried `na_values='Lücke'` and it didn't work as well. – vmg Feb 12 '15 at 11:18
  • By codification problem you mean things like UTF-8 v. ISO 8859-15? That was my idea too, but it seems like the CSV files are of the same flavour. And with the first file, `L\xfccke` gets recognized correctly (On my machine, which is set to German), replacing the `Lücke` with `NaN`, just as I want it - and expect it - to do. – JC_CL Feb 12 '15 at 11:26
  • Weird. I posted a workaround below. Try `na_values=u'Lücke'` to see if its something like that. – vmg Feb 12 '15 at 11:27

3 Answers3

3

The values for the level column in the second file include trailing whitespace. This is because the second file has an additional column after level as denoted by the trailing delimiter in header and rows, and so the whitespace is considered part of the field. Consider this file which is a minimal example of the second file (note trailing delimiters ;):

Date;level;
01.01.2012 07:00:00   ;Lücke ;

>>> import pandas as pd
>>> data = pd.read_csv('2.csv', sep=';')
>>> data['level'][0]
'L\xc3\xbccke '

The trailing space is included in the value, so na_values must include the space (note UTF8 encoding on my system):

>>> data = pd.read_csv('2.csv', sep=';', na_values=['L\xc3\xbccke'])    # no space
>>> data['level'][0]
'L\xc3\xbccke '

>>> data = pd.read_csv('2.csv', sep=';', na_values=['L\xc3\xbccke '])    # with space
>>> data['level'][0]
nan

So I think that that's basically the cause of your problem. You could try to specify sep as the regular expression r'\s*;' to remove the trailing whitespace from all columns, and this should work for both of your files, as well as other files that might have a different amount of trailing whitespace.

>>> data = pd.read_csv('2.csv', sep=r'\s*;', na_values=['L\xc3\xbccke'])    # no spaces required
>>> data['level'][0]
nan

However:

  1. Whitespace will be stripped from all columns, but this might not be a problem for you.
  2. The Python parser engine will be used instead of the 'c' engine because the latter doesn't support regex separators. A warning is issued presumably because Python will be slower.

If 2 above is a problem, fixing it is difficult because there is no option to read_csv() to strip trailing space. You can supply strip() as a function in the converter dict, but this does not affect the processing of matching na_values. Perhaps you could remove the whitespace when you pre-process the file with sed.

Also, you may need to ensure that you are using the correct character encoding of strings in na_values to match that of your data files. You can play with the encoding parameter if required.

mhawke
  • 84,695
  • 9
  • 117
  • 138
  • That seems like it might be the solution I am looking for (Hadn't expected that trailing whitespace could be an issue, since it works perfectly with the date field in the first file)! Currently trying the `sep=r'\s*;'` part. It works with the first file, but my second still fails. Does it expect that whitespace is always consistent throughout the file? Because reading in the second file like this, results in `Data columns: level 316 non-null values Unnamed: 2 0 non-null values dtypes: float64(2)`, so I suppose my first line `Date;level;` without the spaces causes issues. – JC_CL Feb 12 '15 at 13:06
  • Trailing whitespace of any length is handled because the regex `\s*;` matches zero or more whitespace characters followed by a semicolon. – mhawke Feb 12 '15 at 13:12
  • OK, I see. The first file (without whitespace, and only two columns) works, but the second gets read in with three columns, even though i specified `usecols=[0,1]`. And the third column is now `Unnamed: 2 0 non-null values` which causes all kinds of mayhem in my script. I guess I should find some preprocessing that gets rid of the empty third column? – JC_CL Feb 12 '15 at 13:16
  • To your question could you add a minimal sample CSV file that causes the problem you see? – mhawke Feb 12 '15 at 13:20
  • I have edited a file with a minimal header and its results into the question. – JC_CL Feb 12 '15 at 13:28
  • 1
    I don't have the problem with the minimal data - the output is expected without any additional column added. I do notice that there is extra whitespace trailing the final `;`, but that shouldn't cause any extra columns if you specify `usecols=[0,1]`. What version of pandas are you using? I have 0.14.0. – mhawke Feb 12 '15 at 13:47
  • I am on the somewhat old pandas 0.10.0. It seems like there are some bugs in usecols. I fixed with your helpful answer and a workaround (see my following answer). – JC_CL Feb 12 '15 at 13:54
0

The answer by @mhawke solved the first part of my issue.

The following issue, with me getting a third column full of empty data that causes my code to run amok, could be an issue with the old pandas version shipped by fedora.

According to this Question usecols used to have some bugs, and I am still on pandas 0.10, so I am probably affected. As in usecols=[0,1] which should rid me of my third, empty column gets ignored.

My workaround is a simple del MyData['Unnamed: 2'] that gets triggered if my header analysis finds out that i have those dreaded three columns.

So far, with a test dataset of 15 CSV files, it works, but I am kinda worried about hardcoding 'Unnamed: 2' as column name. I dont yet understand why and how that column got that name.

I am no expert (who would have guessed that…) but it somehow doesn't really feel like a good solution.

Community
  • 1
  • 1
JC_CL
  • 2,346
  • 6
  • 23
  • 36
-1

I'd also like to understand the issue here (which I suspect has to do with the ¨ codification causing Lücke not being correctly interpreted as Nan), but here's a workaround.

Here's my "test.csv" file:

A;B;C
1;2;;3
4;5;Lücke;6

Here's my code:

def lucke_to_zero(val):
    if val == 'Lücke':
        return 0
    if not val:
        return 0
    return val 

mdict = {'B': lucke_to_zero}
frame = pd.read_csv("test.csv",sep=';',converters=mdict)
frame.head()

Results in the expected dataframe:

   A  B  C
1  2  0  3
4  5  0  6
vmg
  • 4,176
  • 2
  • 20
  • 33
  • Seems like I wont need your workaround, since with your test.csv, a simple `frametest=pd.read_csv('test.csv',sep=';',na_values='Lücke')` in ipython works as expected, resulting in `In [572]: frametest Out[572]: A B C 1 2 NaN 3 4 5 NaN 6 ` as expected. However my `L\xfccke` now ignores the `Lücke`. And using `Lücke` in my main code still doesnt work, even though I declared `#coding: utf-8` now at the beginning of my file. – JC_CL Feb 12 '15 at 11:43
  • And you workaround works with your testdata, but when I try in on my real data, It ignores the `Lücke`. It seems like it is not an encoding problem, since when I replace `Lücke` with `NaN` in the original file by hand, it still causes the same error. (see edits in my question) – JC_CL Feb 12 '15 at 11:50
  • I think *I* must have an encoding setting going wrong, because `na_values='Lücke'` does not work for me unless I define utf-8 expliclty! This was a very interesting problem. – vmg Feb 12 '15 at 14:45