15

I have a huge dataframe with many columns, many of which are of type datetime.datetime. The problem is that many also have mixed types, including for instance datetime.datetime values and None values (and potentially other invalid values):

0         2017-07-06 00:00:00
1         2018-02-27 21:30:05
2         2017-04-12 00:00:00
3         2017-05-21 22:05:00
4         2018-01-22 00:00:00
                 ...         
352867    2019-10-04 00:00:00
352868                   None
352869            some_string
Name: colx, Length: 352872, dtype: object

Hence resulting in an object type column. This can be solved with df.colx.fillna(pd.NaT). The problem is that the dataframe is too big to search for individual columns.

Another approach is to use pd.to_datetime(col, errors='coerce'), however this will cast to datetime many columns that contain numerical values.

I could also do df.fillna(float('nan'), inplace=True), though the columns containing dates are still of object type, and would still have the same problem.

What approach could I follow to cast to datetime those columns whose values really do contain datetime values, but could also contain None, and potentially some invalid values (mentioning since otherwise a pd.to_datetime in a try/except clause would do)? Something like a flexible version of pd.to_datetime(col)

yatu
  • 86,083
  • 12
  • 84
  • 139
  • Are the object stored in the DataFrame type`datetime.datetime` or `pandas._libs.tslibs.timestamps.Timestamp`? If the former my recommendation would be to change whatever created the datetime to the type that `pandas` handles a bit better. – ALollz Oct 28 '19 at 15:35
  • Are the `None` in your columns, actual `None` or string representatives of it? – Erfan Oct 28 '19 at 15:36
  • They are `None`, not string. Potentially there can be wrong values also... @erfan – yatu Oct 28 '19 at 15:36
  • I am creating the dataframe using `read_sql`, and this is directly what I get. Some cols are not correctly parsed apparently (they have `object` dtype), and some are, so I am having a hard time parsing those that indeed should be datetime columns @ALollz – yatu Oct 28 '19 at 15:37
  • 3
    Then I wonder, how is the sql model in your database? Since sql forces certain types of columns. How did you end up with mixed type columns? Can you maybe also show a column which has `datetime` and `values` in it? – Erfan Oct 28 '19 at 15:45
  • Assuming you needed these values to work with, at that point you would know which columns are needed right? Wouldn't it be feasible to cast what you need prior to working with the data? – r.ook Oct 28 '19 at 15:52
  • My suspicion is that they are not correctly parse since they contain other values, not just `None` or `np.nan`, and hence are not correctly parsed? In which case I'd still be interested in somehow detect these columns and parse them @arfan perhaps with some other datetime parsing lib? Not sure... – yatu Oct 28 '19 at 15:55
  • No @r.ook I have about 600 column, and perhaps 100 are datetime? And roughly half of these not correctly parsed – yatu Oct 28 '19 at 15:56
  • Yes that is a good point @erfan , unfortunately I was not involved in setting up the `db`, and had it kind of given to me as is – yatu Oct 28 '19 at 16:06
  • 1
    use dateutil parser to guess datetime. May be set threshold of several (say 5 dates) in column to be sure https://stackoverflow.com/questions/9507648/datetime-from-string-in-python-best-guessing-string-format – Serge Nov 07 '19 at 17:50
  • Do datetime colums have some repeating pattern? Like date_1, date_2, date_xyz ... ? @yatu – Vishnudev Krishnadas Nov 09 '19 at 04:33
  • No, potentially many formats @vishnu – yatu Nov 09 '19 at 09:51

2 Answers2

5

This function will set the data type of a column to datetime, if any value in the column matches the regex pattern(\d{4}-\d{2}-\d{2})+ (e.g. 2019-01-01). Credit to this answer on how to Search for String in all Pandas DataFrame columns and filter that helped with setting and applying the mask.

def presume_date(dataframe):
    """ Set datetime by presuming any date values in the column
        indicates that the column data type should be datetime.

    Args:
        dataframe: Pandas dataframe.

    Returns:
        Pandas dataframe.

    Raises:
        None
    """
    df = dataframe.copy()
    mask = dataframe.astype(str).apply(lambda x: x.str.match(
        r'(\d{4}-\d{2}-\d{2})+').any())
    df_dates = df.loc[:, mask].apply(pd.to_datetime, errors='coerce')
    for col in df_dates.columns:
        df[col] = df_dates[col]
    return df

Working from the suggestion to use dateutil, this may help. It is still working on the presumption that if there are any date-like values in a column, that the column should be a datetime. I tried to consider different dataframe iterations methods that are faster. I think this answer on How to iterate over rows in a DataFrame in Pandas did a good job describing them.

Note that dateutil.parser will use the current day or year for any strings like 'December' or 'November 2019' with no year or day values.

import pandas as pd
import datetime
from dateutil.parser import parse

df = pd.DataFrame(columns=['are_you_a_date','no_dates_here'])
df = df.append(pd.Series({'are_you_a_date':'December 2015','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'February 27 2018','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'May 2017 12','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'2017-05-21','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':None,'no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'some_string','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'Processed: 2019/01/25','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'December','no_dates_here':'just a string'}), ignore_index=True)


def parse_dates(x):
    try:
        return parse(x,fuzzy=True)
    except ValueError:
        return ''
    except TypeError:
        return ''


list_of_datetime_columns = []
for row in df:
    if any([isinstance(parse_dates(row[0]),
                       datetime.datetime) for row in df[[row]].values]):
        list_of_datetime_columns.append(row)

df_dates = df.loc[:, list_of_datetime_columns].apply(pd.to_datetime, errors='coerce')

for col in list_of_datetime_columns:
    df[col] = df_dates[col]

In case you would also like to use the datatime values from dateutil.parser, you can add this:

for col in list_of_datetime_columns:
    df[col] = df[col].apply(lambda x: parse_dates(x))
Rick
  • 347
  • 4
  • 16
  • This is a nice idea, but unfortunately I'm looking for something that can generalize to potentially several different datetime formats, so without hardcoding the format. Appreciate the effort though – yatu Nov 07 '19 at 17:22
  • @yatu Not a problem - I just happened to be working on something that needed this. I wonder if you can generalize to all datetime formats though? You may have to just account ahead of time for all of the formats that you would expect to see; or, all of the formats you would consider to be valid datetime. – Rick Nov 07 '19 at 18:49
  • @yatu Actually that `dateutil` module mentioned by @Serge looks like it could be useful. – Rick Nov 07 '19 at 19:12
  • @yatu please see my updated answer. I used `dateutil.parse` to identify many different kinds of date strings. – Rick Nov 09 '19 at 01:39
  • Looks good! don't have much time now, will take a look as soon as I can @yes – yatu Nov 09 '19 at 09:51
  • Thanks @yes this helped! Nice to see how to use `dateutil.parse` here – yatu Nov 15 '19 at 09:13
2

The main problem I see is when parsing numerical values.

I'd propose converting them to strings first


Setup

dat = {
    'index': [0, 1, 2, 3, 4, 352867, 352868, 352869],
    'columns': ['Mixed', 'Numeric Values', 'Strings'],
    'data': [
        ['2017-07-06 00:00:00', 1, 'HI'],
        ['2018-02-27 21:30:05', 1, 'HI'],
        ['2017-04-12 00:00:00', 1, 'HI'],
        ['2017-05-21 22:05:00', 1, 'HI'],
        ['2018-01-22 00:00:00', 1, 'HI'],
        ['2019-10-04 00:00:00', 1, 'HI'],
        ['None', 1, 'HI'],
        ['some_string', 1, 'HI']
    ]
}

df = pd.DataFrame(**dat)

df

                      Mixed  Numeric Values Strings
0       2017-07-06 00:00:00               1      HI
1       2018-02-27 21:30:05               1      HI
2       2017-04-12 00:00:00               1      HI
3       2017-05-21 22:05:00               1      HI
4       2018-01-22 00:00:00               1      HI
352867  2019-10-04 00:00:00               1      HI
352868                 None               1      HI
352869          some_string               1      HI

Solution

df.astype(str).apply(pd.to_datetime, errors='coerce')

                     Mixed Numeric Values Strings
0      2017-07-06 00:00:00            NaT     NaT
1      2018-02-27 21:30:05            NaT     NaT
2      2017-04-12 00:00:00            NaT     NaT
3      2017-05-21 22:05:00            NaT     NaT
4      2018-01-22 00:00:00            NaT     NaT
352867 2019-10-04 00:00:00            NaT     NaT
352868                 NaT            NaT     NaT
352869                 NaT            NaT     NaT
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Well it looks like this just hugely simplifies the problem. I didn't even think of this. The ideal scenario was to simply apply `pd.to_datetime` and `coerce` the errors, sInce there are many. The problem was with the numerical columns. But it didn't occur to me that numerical columns cast to string are not parsed by pandas' `to_datetime`. Thanks so much, this really helps! – yatu Nov 13 '19 at 19:57