I have a rather annoyingly formatted CSV file, and I end up with time and date split into two columns. I want to combine them into one datetime index. I have figured out a way to to it, but it's rather slow:
MWE:
CSV example:
21.03.1968 07:01:00 661.79999 "I,"
25.03.1968 07:01:00 661.77002
28.03.1968 07:01:00 661.75 "I,"
01.04.1968 07:01:00 661.72998
04.04.1968 07:01:00 661.1 "I,"
08.04.1968 07:01:00 661.70001
As you can see, it uses a single whitespace as delimiter, and 3 for the last column, and it's (euro-style) time series data. I'm reading it with
df = pd.read_csv('file.csv', delim_whitespace=True, # Should use spaces as delimter
header=0, names=['year', 'hours', 'data'],
usecols=[0,1,2]) # get's rid of the weird "I,"
and end up with a df with a default 0, 1, 2, 3...
index and year
, hours
and data
columns. As this is a time series, I obviously want to turn year
and hours
into a datetime index.
MY first instinct was to just brute force it, by looping over the rows and fucking around with the strings, but as we all know, that's not the best way. So with some quick searching, I found the seemingly obvious way:
df['date'] = pd.to_datetime(df['year'] +' '+df['hours'], dayfirst=True)
and then df = df.set_index('date')
and df.drop(['year', 'hours'], axis=1)
.
Works perfectly fine, however, with my real data (test file has 110506 rows, I've got approx. 880 files in total), %timeit
tells me 7.96 s ± 95.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
, so I'm looking at around two hours for the whole deal!
So, is there a (significantly) faster way of combining two dataframe columns into one datetime index?
It would be great to just define fixed columns lengths in read_csv
, but that doesn't seem to be possible. I'm always amazed about the thousands of easy and blazingly fast comfort-functions inbuilt into pandas, but if one exists for this (rather common, I'd have thought) case exists, I haven't found it yet. read_fwf
, comented by
Quang Hoang seemed good, but as I just learned, my data
field has various lengths! So my data
always ends up with various 661.1 "
strings in what should be floats. Maybe in the linked apply
answer? But I can't figure it out.