-1

I'm trying to get the index of my dataFrame to be of type datetime. My CSV file contains seperate columns of Dates and Times which i combine upon importing:

df = pd.read_csv("example.csv", sep=";", decimal=",", parse_dates=[["Date", "Time"]])

It will look like this after the import:

Date_Time
0 1012020 00:00:00
1 1012020 00:15:00

The problem is the missing leading zero on the first 9 days of each month. Pandas to_datetime() needs a leading zero for the %d format option to work. When i use format="%d%m%Y%H:%M:%S" python says "invalid syntax"

How can I convert this column to datetime?

sushanth
  • 8,275
  • 3
  • 17
  • 28
KaaraT
  • 90
  • 6

1 Answers1

1

Use Series.str.zfill (as suggested by @FObersteiner in the comments) and apply pd.to_datetime afterwards:

import pandas as pd

# changing 2nd val to `'12012020 00:15:00'` to show that 
# only the 1st val is affected
data = {'Date_Time': {0: '1012020 00:00:00', 1: '12012020 00:15:00'}}

df = pd.DataFrame(data)

df['Date_Time'] = pd.to_datetime(df["Date_Time"].str.zfill(17), 
                                 format="%d%m%Y %H:%M:%S")

print(df)

            Date_Time
0 2020-01-01 00:00:00
1 2020-01-12 00:15:00

print(df['Date_Time'].dtype)
datetime64[ns]

Another (admittedly, unnecessarily complicated) way to go, would be to use a regex pattern to replace all "dates" with 7 digits by their 8-digit equivalent:

df['Date_Time'] = pd.to_datetime(
    df['Date_Time'].replace(r'^(\d{7}\s)',r'0\1', regex=True),
    format="%d%m%Y %H:%M:%S")

Explanation r'^(\d{7}\s)':

  • ^ assert position at start of the string
  • \d{7}\s matches 7 digits followed by a whitespace
  • The encapsulating brackets turn this into a Capturing Group

Explanation r'0\1':

  • \1 refers back to the Capturing Group (1st of 1 group(s)), to which we prepend 0
ouroboros1
  • 9,113
  • 3
  • 7
  • 26
  • Reads like you could achieve the same with str.zfill(8), no? But what if `1012020` means 10th of January? – FObersteiner Sep 04 '22 at 13:52
  • @FObersteiner: yes, hadn't thought of that. `df['Date_Time'].str.zfill(17)` should work (17 for the whole string). `1012020` will not mean 10th of January. As specified by the OP: only the day is missing the leading zero. So, 10th of January would be `10012020`. – ouroboros1 Sep 04 '22 at 14:14
  • I tried both but pandas seems to be the problem here. @ouroboros1 Trick worked but pandas removes the leading 0 again. The column is of type pandas.core.series.Series. I tried the solutions from this post https://stackoverflow.com/questions/13250046/how-to-keep-leading-zeros-in-a-column-when-reading-csv-with-pandas but it had no effect... – KaaraT Sep 05 '22 at 08:34
  • What exactly do you mean by "Trick worked but pandas removes the leading 0 again"? At what point is the "trick" working/at what stage does pandas seem to remove the 0 again? If you simply do `print(df['Date_Time'].str.zfill(17))`, does it show the strings as `01012020 00:00:00`, `01012020 00:15:00`, etc. with a leading 0, when required? And are you getting an actual error, or is the result just not what you expected. If the former, please share the error; if the latter, what exactly is the result? – ouroboros1 Sep 05 '22 at 08:59
  • @ouroboros1 Okay so i do `df["Date_Time"].str.zfill(17)` which will print out `0 01012020 00:00:00`. However if I do `df["Date_Time"]` in the next cell the output will be `0 1012020 00:00:00` again, without the leading zero. So it doesn't seem to get overwritten? – KaaraT Sep 06 '22 at 06:36
  • It's not an `inplace` operation. You need to overwrite the value of `df["Date_Time"]` for pandas to keep the change. So, use: `df["Date_Time"] = df["Date_Time"].str.zfill(17)`. But the full line would simply be: `df['Date_Time'] = pd.to_datetime(df["Date_Time"].str.zfill(17), format="%d%m%Y %H:%M:%S")` – ouroboros1 Sep 06 '22 at 06:39