15

I have a datatime data, their format is like 29062017 and 01AUG2017. As you can see, the month is in the middle of data.

I want to convert this data to datetime, when I use pd.to_datetime, but it doesn't work.

Do you know a good way to solve this problem?

cs95
  • 379,657
  • 97
  • 704
  • 746
yuchen huang
  • 257
  • 1
  • 2
  • 10
  • 1
    Does your column have mixed formats? Tif not, the first format is simple - `'%d%m%Y'` and the second is `'%d%b%Y'`. If yes, you'll need to do a little preprocessing before conversion. – cs95 Nov 13 '17 at 02:21

5 Answers5

26

You can use pd.to_datetime's format arg:

In [11]: s = pd.Series(["29062017", "01AUG2017"])

In [12]: pd.to_datetime(s, format="%d%m%Y", errors="coerce")
Out[12]:
0   2017-06-29
1          NaT
dtype: datetime64[ns]

In [13]: pd.to_datetime(s, format="%d%b%Y", errors="coerce")
Out[13]:
0          NaT
1   2017-08-01
dtype: datetime64[ns]

Note: the coerce argument means that failures will be NaT.

and fill in the NaNs from one into the other e.g. using fillna:

In [14]: pd.to_datetime(s, format="%d%m%Y", errors="coerce").fillna(
    ...:     pd.to_datetime(s, format="%d%b%Y", errors="coerce"))
Out[14]:
0   2017-06-29
1   2017-08-01
dtype: datetime64[ns]

Any strings that don't match either format will remain NaT.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • This would've been one option. I was actually thinking of substitution followed by the conversion (as in my answer). – cs95 Nov 13 '17 at 02:31
  • @cᴏʟᴅsᴘᴇᴇᴅ my bet is this'll be faster, but I think both are good. – Andy Hayden Nov 13 '17 at 02:42
  • I thought so too, because regex based substitution is pretty slow in general. I did a benchmark on `s = pd.concat([s] * 100000)` and... interestingly... yours is `1.01s`, mine is `1s`. Neck and neck! Let me know if you see any different results. – cs95 Nov 13 '17 at 02:44
  • @cᴏʟᴅsᴘᴇᴇᴅ in my testing these seemed to be much of a muchness. I am a little disappointed at the speed of to_datetime, I had imagined it was cythonized but perhaps not? – Andy Hayden Nov 13 '17 at 04:26
  • 1
    @cᴏʟᴅsᴘᴇᴇᴅ having - long ago - poked around in the python datetime library which pandas was using (perhaps it's something better/different now)... I won't be doing again in a hurry - datetime handling is a world of pain. – Andy Hayden May 10 '18 at 22:34
  • This is a neat solution (+1)! But it obviously works only with two different formats, but not with more. Does anybody have an idea of how to solve it? I thought errors = 'ignore' might do the job, but it doesn't. – W. Walter Jan 11 '22 at 11:21
7

The alternative would be to use a mapper and replace to substitute month codes with their numerical equivalent:

s = pd.Series(["29062017", "01AUG2017"]); s

0     29062017
1    01AUG2017
dtype: object

m = {'JAN' : '01', ..., 'AUG' : '08', ...}  # you fill in the rest

s = s.replace(m, regex=True); s

0    29062017
1    01082017
dtype: object

Now all you need is a single pd.to_datetime call:

pd.to_datetime(s, format="%d%m%Y", errors="coerce")

0   2017-06-29
1   2017-08-01
dtype: datetime64[ns]
cs95
  • 379,657
  • 97
  • 704
  • 746
5

Since you have two type of datetime ...

s.apply(lambda x : pd.to_datetime(x, format="%d%m%Y") if x.isdigit() else pd.to_datetime(x, format="%d%b%Y"))

Out[360]: 
0   2017-06-29
1   2017-08-01
dtype: datetime64[ns]
BENY
  • 317,841
  • 20
  • 164
  • 234
4

I wanted to weigh in with some options

Setup

m = dict(
    JAN='01', FEB='02', MAR='03', APR='04',
    MAY='05', JUN='06', JUL='07', AUG='08',
    SEP='09', OCT='10', NOV='11', DEC='12'
)

m2 = m.copy()
m2.update({v: v for v in m.values()})

f = lambda x: m.get(x, x)

Option 1
list comprehension

pd.Series(
    pd.to_datetime(
        [x[:2] + f(x[2:5]) + x[5:] for x in s.values.tolist()],
        format='%d%m%Y'),
    s.index)

0   2017-06-29
1   2017-08-01
dtype: datetime64[ns]

Option 2
Create a dataframe

pd.to_datetime(
    pd.DataFrame(dict(
        day=s.str[:2],
        year=s.str[-4:],
        month=s.str[2:-4].map(m2)
    )))

0   2017-06-29
1   2017-08-01
dtype: datetime64[ns]

Option 2B
Create a dataframe

pd.to_datetime(
    pd.DataFrame(dict(
        day=s.str[:2],
        year=s.str[-4:],
        month=s.str[2:-4].map(f)
    )))

0   2017-06-29
1   2017-08-01
dtype: datetime64[ns]

Option 2C
Create a dataframe
I estimate this to be the quickest

pd.to_datetime(
    pd.DataFrame(dict(
        day=s.str[:2].astype(int),
        year=s.str[-4:].astype(int),
        month=s.str[2:-4].map(m2).astype(int)
    )))

0   2017-06-29
1   2017-08-01
dtype: datetime64[ns]

Test

s = pd.Series(["29062017", "01AUG2017"] * 100000)

%timeit pd.to_datetime(s.replace(m, regex=True), format='%d%m%Y')
%timeit pd.to_datetime(s.str[:2] + s.str[2:5].replace(m) + s.str[5:], format='%d%m%Y')
%timeit pd.to_datetime(s.str[:2] + s.str[2:5].map(f) + s.str[5:], format='%d%m%Y')
%timeit pd.to_datetime(s, format='%d%m%Y', errors='coerce').fillna(pd.to_datetime(s, format='%d%b%Y', errors='coerce'))
%timeit pd.Series(pd.to_datetime([x[:2] + f(x[2:5]) + x[5:] for x in s.values.tolist()], format='%d%m%Y'), s.index)
%timeit pd.to_datetime(pd.DataFrame(dict(day=s.str[:2], year=s.str[-4:], month=s.str[2:-4].map(m2))))
%timeit pd.to_datetime(pd.DataFrame(dict(day=s.str[:2], year=s.str[-4:], month=s.str[2:-4].map(f))))
%timeit pd.to_datetime(pd.DataFrame(dict(day=s.str[:2].astype(int), year=s.str[-4:].astype(int), month=s.str[2:-4].map(m2).astype(int))))

1.39 s ± 24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
690 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
613 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
533 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
529 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
557 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
607 ms ± 26.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
328 ms ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I'd not seen that pd.to_datetime could take a DataFrame, I guess that has been cythonized and the datetime parser is (sadly) still python? I am surprised it's fastest! – Andy Hayden Nov 13 '17 at 04:28
3

Here is my solution for this problem:

def set_date(col):
    # date_formates = ["21 June, 2018", "12/11/2018 09:15:32", "April-21" ]
    date_formats = ["%d %B, %Y", "%d/%m/%Y %H:%M:%S", "%B-%y", "%d %B, %Y", "%m/%d/Y"] # Can add different date formats to this list to test
    for x in date_formats:
        col = pd.to_datetime(col, errors="ignore", format= f"{x}")

    col = pd.to_datetime(col, errors="coerce") # To remove errors in the columns like strings or numbers
    return col
Flair
  • 2,609
  • 1
  • 29
  • 41
  • I think this solution helps for any generalization to the issue, did you tried also to `%timeit` compared with some of the other methods ? – Andrea Ciufo Feb 15 '23 at 18:23