0

I have a rather annoyingly formatted CSV file, and I end up with time and date split into two columns. I want to combine them into one datetime index. I have figured out a way to to it, but it's rather slow:

MWE:

CSV example:

21.03.1968 07:01:00 661.79999   "I,"
25.03.1968 07:01:00 661.77002   
28.03.1968 07:01:00 661.75   "I,"
01.04.1968 07:01:00 661.72998
04.04.1968 07:01:00 661.1   "I,"
08.04.1968 07:01:00 661.70001

As you can see, it uses a single whitespace as delimiter, and 3 for the last column, and it's (euro-style) time series data. I'm reading it with

df = pd.read_csv('file.csv', delim_whitespace=True, # Should use spaces as delimter
                 header=0, names=['year', 'hours', 'data'],
                 usecols=[0,1,2]) # get's rid of the weird "I,"

and end up with a df with a default 0, 1, 2, 3... index and year, hours and data columns. As this is a time series, I obviously want to turn year and hours into a datetime index.

MY first instinct was to just brute force it, by looping over the rows and fucking around with the strings, but as we all know, that's not the best way. So with some quick searching, I found the seemingly obvious way:

df['date'] = pd.to_datetime(df['year'] +' '+df['hours'], dayfirst=True)

and then df = df.set_index('date') and df.drop(['year', 'hours'], axis=1).

Works perfectly fine, however, with my real data (test file has 110506 rows, I've got approx. 880 files in total), %timeit tells me 7.96 s ± 95.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each), so I'm looking at around two hours for the whole deal!

So, is there a (significantly) faster way of combining two dataframe columns into one datetime index?

It would be great to just define fixed columns lengths in read_csv, but that doesn't seem to be possible. I'm always amazed about the thousands of easy and blazingly fast comfort-functions inbuilt into pandas, but if one exists for this (rather common, I'd have thought) case exists, I haven't found it yet. read_fwf, comented by
Quang Hoang seemed good, but as I just learned, my data field has various lengths! So my data always ends up with various 661.1 " strings in what should be floats. Maybe in the linked apply answer? But I can't figure it out.

JC_CL
  • 2,346
  • 6
  • 23
  • 36
  • 1
    [`pd.read_fwf`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html) is better for you in this case. – Quang Hoang Aug 01 '22 at 08:07
  • Have you tried to give pandas more information on the formatting? `format="%m.%d.%Y %H:%M%S"` – Jacob Aug 01 '22 at 08:10
  • @QuangHoang: That seemed like the method I was searching for, but apparently, my files are even worse than I though (`data` field has various lenghts). @Jacob: Shouldn't `dayfirst=True` be enough information? If I read in a "good" CSV, that always works without noticeable delay, so it seems like the slow part is in the merging. – JC_CL Aug 01 '22 at 09:23

1 Answers1

0

Using the pd.to_datetime() is the correct way to go. However, to merge the dates you also need to know if the dates in your files are matching, or if they need to be synchronized.

In case they match, you can use pd.join() with the datetime as key.

If they do not match you can use pd.merge_asof(). This will match your datetime with the closest one it can find. You will lose some accuracy but your data will be synchronized.

  • What do you mean with "matching"? I need to take each line in the raw file as is, i.e. the `DD.MM.YYY` and the `hh:mm:ss` part must ALWAYS be combined, as they are read in. – JC_CL Aug 01 '22 at 09:25
  • If in the first file, one line is i.e. 10:35:17 and in the other file this time does not exist, it will try to match it to the closest one. So 10:35:17 could be matched to 10:35:18. If you know that you have the same timestamps across your files then you do not need to match your timeframes. – P.Grimanelis Aug 01 '22 at 09:42