6

I'm trying to extract dates in any format from a pandas column (the date is a part of a longer string).

I have found this answer which does it outside of pandas, but I'm not sure how to use that to do it in a pandas column.

The dates can be in various formats, for example:

footballer, born October 1989
footballer, born 1900s
footballer, born 29 December 1987
Brazilian footballer, born 1983
31/02/1901
16 May 2019

Is it possible to have any date format and parts of dates as a pandas column with a date type?

Superdooperhero
  • 7,584
  • 19
  • 83
  • 138

2 Answers2

6

using the approach from the linked answer:

import dateutil.parser as dparser
s.apply(lambda x: dparser.parse(x,fuzzy=True).strftime('%Y-%m-%d'))

Although dparser can't of course cope with all possibilities: in the sample data you'll have to change footballer, born 1900s to footballer, born 1900's, otherwise parse will complain that second must be in 0..59

If you need exception handling, you'll have to define a regular function as lambdas can't handle try/except:

def myparser(x):
    try:
       return dparser.parse(x,fuzzy=True)
    except:
       return None

s.apply(lambda x: myparser(x))

This will insert NaT values for wrong dates (or you can provide a 'default date' if you like):

0    1989-10-12
1           NaT
2    1987-12-29
3    1983-07-12
4           NaT
5    2019-05-16
Stef
  • 28,728
  • 2
  • 24
  • 52
  • it appears `pd.to_datetime` does not have any option for `infer_datetime_format` to be applied per instance - so this may also be your fastest obvious option. – modesitt Jul 12 '19 at 21:08
  • @modesitt and infer_datetime_format won't handle cases when other text (footballer etc) is in the string – Stef Jul 12 '19 at 21:20
  • Is this for a specific column? `df['birth_date'] = df.apply(lambda x: dparser.parse(x, fuzzy=True).strftime('%Y-%m-%d'))` gives me a `TypeError: ('Parser must be a string or character stream, not Series', 'occurred at index subject')` error – Superdooperhero Jul 12 '19 at 22:20
  • 2
    @Superdooperhero You try to apply the function to the whole dataframe, apply it to a specific column: if you have a dataframe with a column `Strings` then use this to insert a new column `birth_date`: `df['birth_date']=df.Strings.apply(lambda x: dparser.parse(x,fuzzy=True))`. – Stef Jul 13 '19 at 05:42
  • @Stef: Thanks that fixed the previous error, but now I get this error: `ValueError: ('String does not contain a date:', 'footballer')`, sorry I forgot to mention that some rows does not contain a date but instead just a random string without a date. – Superdooperhero Jul 13 '19 at 05:54
  • 1
    @Superdooperhero invalid or unknown string formats without a date will raise this ValueError, see my edit above for error handling. – Stef Jul 13 '19 at 05:57
0

Try this, if it can't recognize a row as containing a date it will return 1/1/1 , if the date is not complete with month and date will assume january 1st, but you can change it, by adjust the default.

import pandas as pd
import numpy as np
from datetime import datetime
from dateutil.parser import parse

l = ['footballer, born October 1989',
'footballer, born 1900s',
'footballer, born 29 December 1987',
'Brazilian footballer, born 1983',
'31/02/1901',
'16 May 2019']

df  = pd.Series(l, name='strings')

def get_dates(series):
my_list =[]
for i in range(len(series)):
    for j in range(len(series[i])):
        try:
            my_list.append(parse(series[i][j:],default=datetime(1, 1, 1)).strftime('%Y-%m-%d'))
            break                
        except:
            pass
return pd.Series(my_list)    


get_dates(df)

0    1989-10-01
1    0001-01-01
2    1987-12-29
3    1983-01-01
4    1901-01-02
5    2019-05-16
dtype: object