0

I have an array of date strings, eg: ["1999-2-4", "1989-2", "2020", "1914/09/01"]

I'm converting these strings to timestamps with pandas' to_datetime.

But I get back a standard pandas datetime with ns precision. I need some way of also knowing what the original precision of the string was (ie [day, month, year, day] for the array above)

What I initially tried was setting up an array of formats matched with an array of precisions:

1: ["%Y-%M-%D", "%Y/%M/%D", "%Y-%M", "%Y"]

2: ["day", "day", "year", "month"]

and I planned on simply trying each format in order until one worked, and then taking the matching precision.

However, unfortunately (for my purposes), an input like "1999" passed to to_datetime with format="%Y-%M-%D", even with exact=True, will successfully parse. So there went the plan of relying on try-catching in a loop.

I need some way of getting the original precision. Is this possible with pandas? Alternatively, is this possible with dateutils?

dWitty
  • 494
  • 9
  • 22

3 Answers3

2

One core question that arises: How do you plan to take the information on the precision into account lateron?

In your case (also given the differences in the formatting of the days and months with optional leading zeros), I would go with an approach that first gets the individual date components (year, month, day) and then combine these.


def parse_date(s):
    date_entries = s["date"].split("-")
    s["year"] = int(date_entries[0]) if len(date_entries) > 0 else None
    s["month"] = int(date_entries[1]) if len(date_entries) > 1 else None
    s["day"] = int(date_entries[2]) if len(date_entries) > 2 else None
    return s

dates = ["1999-2-4", "1989-2", "2020", "1914-09-01"]
pd.DataFrame(dates, columns=["date"]).apply(parse_date, axis=1)

Output:

      date      year    month   day
0   1999-2-4    1999    2.0     4.0
1   1989-2      1989    2.0     NaN
2   2020        2020    NaN     NaN
3   1914-09-01  1914    9.0     1.0

Note that year, month and day will be np.float (given the existence of missing values). You can add concrete calculations for the precision into the parse_date-function and also combine them according to your needs in a new column.

Alternatively, you can also use .str.extract providing a regular expression:

df = pd.DataFrame(dates, columns=["date"])
df["date"].str.extract("(?P<year>[0-9]{4})-?(?P<month>[01]?[0-9])?-?(?P<day>[0-3]?[0-9])?")

Output:

    year    month   day
0   1999    2        4
1   1989    2       NaN
2   2020    NaN     NaN
3   1914    09       01
sim
  • 1,227
  • 14
  • 20
  • thanks. unfortunately, the actual date formats are not uniformly separated by dashes-- could be /, could be other things. any parsable format. (I've edited the question to make this more clear) I do not need to do anything with the precision data other than store it, for now. (humans will read the output and need the precision information stored there) – dWitty Jul 29 '20 at 12:05
  • @dWitty: I would try to figure out the existing `datetime` `format`s upfront and adjust the regular expression (or the function to apply as per first example) accordingly. Especially when you are not certain about your input format, try to code defensively (e.g. if you are not sure whether a date is "month"- or "day"-major, is it sensible to simply assume the correct format?). There's no "general" best solution here - perhaps you could infer information on the format based on other values? – sim Jul 29 '20 at 16:38
0

From my point of view this is not the best approach. Try catch should not be used for control flow of your program in the case you can implement it. Why you do not use exact format according to input. Something like

def get_format(input):
   if input.count('-') == 0:
      return "%Y"
   if input.count('-') == 1:
      return "%Y-%M"
   if input.count('-') == 2:
      return "%Y-%M-%D"
   if input.count('/') == 2:
      return "%Y/%M/%D"


input = ["1999-2-4", "1989-2", "2020", "1914-09-01"]

results = [x.to_datetime(format=get_format(x)) for x in input]

Or if you could have possibly more formats try to read this

y0j0
  • 3,369
  • 5
  • 31
  • 52
0

Checkout this code. You can add any kind of year sparator as you like -, / etc. at get_dict() function:

import pandas as pd
import re

def get_dict(dates):
    dic_list=[]
    for d in dates:
        dic={}
        list_ = re.split('-|/', d)
        dic['date']= d
        dic['Year'] = (list_[0]) if len(list_) > 0 else None
        dic['Month'] = (list_[1]) if len(list_) > 1 else None
        dic['Day'] = (list_[2]) if len(list_) > 2 else None
        dic_list.append(dic)
    return dic_list
dates = ["1999-2-4", "1989-2", "2020", "1914/09/01"]
dic_list = get_dict(dates)
df = pd.DataFrame(dic_list)
df

Output:

    date        Year    Month   Day
0   1999-2-4    1999    2        4
1   1989-2      1989    2       None
2   2020        2020    None    None
3   1914/09/01  1914    09      01

Add any kind of filtering using iloc

df.iloc[:, 1:]

Output:

    Year    Month   Day
0   1999    2        4
1   1989    2       None
2   2020    None    None
3   1914    09       01