I am doing various analyses and plots on datasets.
I get my datasets as CSV files from a government website. depending on the type of data, the CSV files are slightly different:
- length of the header
- contents of the header
- number of columns
- number of NaN values
- montly or daily data
- magnitude of the values
As a start, I read the first 50 rows with pythons standard CSV reader, to search for various strings that I need later and to find out how long the header actually is.
The next step is replacing some things with sed (changing the names for the columns, exchanging ,
with .
as the decimal).
Then I read it with
Mydata=pd.read_csv(csv_list[counter],sep=';',skiprows=DataStart,index_col='Date',usecols=0,1],dayfirst=True,parse_dates=True,na_values='L\xfccke')
And now comes my Issue - with the following two CSV, the first works, and the second doesn't:
This:
...20 more lines of header
Werteformat: ;1 Nachkommast.
Date;level
01.01.1971 07:00:00 ; 0.0
02.01.1971 07:00:00 ; 0.0
03.01.1971 07:00:00 ; 0.0
...15000 lines of data
01.01.2012 07:00:00 ;Lücke
works, whereas this
...30 more lines of header
Werteformat: ;2 Nachkommast.
Date;level;
01.01.1970 00:00:00 ; 427.27 ;
01.02.1970 00:00:00 ; 427.80 ;
...500 lines of data, with stuff like
01.03.1973 00:00:00 ;Lücke ;
in between
errors out when trying to plot a histogram with
Traceback (most recent call last):
File "plotter.py", line 179, in <module>
plt.hist(Jan_RV)
File "/usr/lib64/python2.7/site-packages/matplotlib/pyplot.py", line 2827, in hist
stacked=stacked, **kwargs)
File "/usr/lib64/python2.7/site-packages/matplotlib/axes.py", line 8326, in hist
m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
File "/usr/lib64/python2.7/site-packages/numpy/lib/function_base.py", line 176, in histogram
mn, mx = [mi+0.0 for mi in range]
TypeError: cannot concatenate 'str' and 'float' objects
OK, so it seems like something gets read as strings, and by running some parts in ipython, I found out that it is the Lücke
part, that they choose to use as NaN
or missing data
representer. I thought I had taken care of that with the na_values='L\xfccke'
in read_csv
, and with the first dataset, it does work. If I look at MyData
in ipython, I see NaN
instead of Lücke
, whereas with the second dataset, the Lücke
stays.
[Edit] The suspicion that the umlaut ü
is causing the issue might be wrong? I changed Lücke
to NaN
by hand, and got rid of the na_values='L\xfccke'
in read_csv
and it still keeps the text parts (now NaN) as strings. I also have
Rlength=len(MyData) #counts everything, including NaN
Rcount=MyData.count() #counts only valid numbers
NaN_Number=Rlength-Rcount
in my script, and with the second dataset, it always results in 0 NaN_Numbers [/edit]
What could be the reason for that? Does na_values
only take the first value, and then stops? Does the fact that i have an empty column in the second dataset cause issues? But that shouldn't matter, since I only take the first two, with usecols=[0,1]
.
Another obvious difference is the fact that the first set is daily data, and the second monthly, but that shouldnt matter. I have some other monthly data, that does work.
One Idea I had, was that the CSV files might have different line endings (Is that the correct term? Unix v. Windows, or carriage return v. line feed) or encodings, but looking at that in my editor, it tells me they are the same. And stat filename.csv
in terminal, also gets me quite similar results.
So I am utterly lost.
Edit2:* Short sample CSV file:
Werteformat: ;2 Nachkommast.
Date ;level ;
01.10.1982 00:00:00 ; 873.33 ;
01.11.1982 00:00:00 ; 873.19 ;
01.12.1982 00:00:00 ;Lücke ;
01.01.1983 00:00:00 ;Lücke ;
01.02.1983 00:00:00 ; 873.17 ;
read in with (Some changes from @mhawke's answer already included):
Tester2=pd.read_csv('test2.csv',sep=r'\s*;',skiprows=1,index_col='Date',dayfirst=True,parse_dates=True,na_values='Lücke',usecols=[0,1])
results in
In [598]: Tester2
Out[598]:
level Unnamed: 2
Date
1982-10-01 873.33 NaN
1982-11-01 873.19 NaN
1982-12-01 NaN NaN
1983-01-01 NaN NaN
1983-02-01 873.17 NaN