1

So I have this data set:

1.0                 20/20/1999
2.0                    31/2014
3.0                       2015
4.0        2008-01-01 00:00:00
5.0        1903-10-31 00:00:00
6.0        1900-01-20 00:00:00
7.0        2011-02-21 00:00:00
8.0        1999-10-11 00:00:00

Those dates imported from excel but since the dataset is large and from multiple sources I can have any number of yyyy-mm-dd permutations with - or / or none as separators and missing months or days. It's a nightmare.

I want to keep those valid formats while those that are not recognized as valid should return a year or nothing.

This is where I got so far:

  1. I import as is from excel
  2. df['date_col'].date_format('%Y-%m-%d')

I found regex to match only year field but I'm stuck on with what to use it on ^[0-9]{2,2}$

I have tried dateutil without success. It's refusing to parse examples with month only

MarkoBox
  • 95
  • 10
  • Possible duplicate of [How to parse an ISO 8601-formatted date?](https://stackoverflow.com/questions/127803/how-to-parse-an-iso-8601-formatted-date) – wnnmaw Jan 05 '18 at 16:32
  • Flagging as a duplicate to a question about a specific format, as the answer to that question notes that `dateutil-parser` will attempt to guess at a format. I tested it for your cases and it works for all but the first two, so you'll still have some work to do, but `20/20/1999` isn't a valid date anyways – wnnmaw Jan 05 '18 at 16:34

1 Answers1

0

I'm not familiar with a DataFrame or Series method called date_format, and your regex doesn't seem to return the year for me. That aside I would suggest defining a function that can handle any of these formats and map it along the date column. Like so:

df
                  date
0           20/20/1999
1              31/2014
2                 2015
3  2008-01-01 00:00:00
4  1903-10-31 00:00:00
5  1900-01-20 00:00:00
6  2011-02-21 00:00:00
7  1999-10-11 00:00:00

def convert_dates(x):
    try:
        out = pd.to_datetime(x)
    except ValueError:
        x = re.sub('^[0-9]{,2}/', '', x)
        out = pd.to_datetime(x)
    return out

df.date.map(convert_dates)
0   1999-01-01
1   2014-01-01
2   2015-01-01
3   2008-01-01
4   1903-10-31
5   1900-01-20
6   2011-02-21
7   1999-10-11
Name: date, dtype: datetime64[ns]

Granted this function doesn't handle strings that don't contain a year, but your sample fails to include an example of this.

Grr
  • 15,553
  • 7
  • 65
  • 85
  • I'm getting TypeError: Unrecognized value type: when trying to map function. But thanks for the hint ill try to work something out. – MarkoBox Jan 08 '18 at 09:05