1

I wrote a function that converts all the dates of the specified date column to the specified format. Any missing or invalid date is replaced with a value specified by the user.

The code also takes "serial dates" like "5679" into consideration. But my code isn't working on the serial numbers. Let me know where the issue is.

My code:

import pandas as pd
import math

def date_fun(df, date_inputs):
    col_name = date_inputs["DateColumn"]
    date_format = date_inputs["DateFormat"]
    replace_value = date_inputs.get("ReplaceDate", None)
    
    # Convert column values to string
    df[col_name] = df[col_name].astype(str)
    
    # Check if the column contains serial dates
    if df[col_name].str.isnumeric().all():
        # Convert the column to integer
        df[col_name] = pd.to_numeric(df[col_name], errors='coerce')
        
        # Check if the values are within the valid range of serial dates in Excel
        if df[col_name].between(1, 2958465).all():
            df[col_name] = pd.to_datetime(df[col_name], unit='D', errors='coerce')
        else:
            if replace_value is not None:
                df[col_name] = replace_value
            else:
                df[col_name] = "Invalid Date"
    else:
        df[col_name] = pd.to_datetime(df[col_name], errors='coerce')
    
    # Convert the datetime values to the specified format
    df[col_name] = df[col_name].dt.strftime(date_format)
    
    # Replace invalid or null dates with the specified value (if any)
    if replace_value is not None:
        replace_value = str(replace_value) # convert to string
        df[col_name] = df[col_name].fillna(replace_value)
    
    new_data = df[col_name].to_dict()
    
    # Handle NaN and infinity values
    def handle_nan_inf(val):
        if isinstance(val, float) and (math.isnan(val) or math.isinf(val)):
            return str(val)
        else:
            return val
    
    new_data = {k: handle_nan_inf(v) for k, v in new_data.items()}
    
    return new_data

Example:

Input: 45678
Expected Output: 2024-06-27  
Current Output: NaN 

  

Input

25.09.2019
9/16/2015
10.12.2017
02.12.2014
08-Mar-18
08-12-2016
26.04.2016
05-03-2016
24.12.2016
10-Aug-19
abc
05-06-2015
12-2012-18
24-02-2010
2008,13,02
16-09-2015
23-01-1992, 7:45

2nd December 2018
45678

My output

            "2019/09/25",
            "2015/09/16",
            "2017/10/12",
            "2014/02/12",
            "2018/03/08",
            "2016/08/12",
            "2016/04/26",
            "2016/05/03",
            "2016/12/24",
            "2019/08/10",
            "nan",
            "2015/05/06",
            "nan",
            "2010/02/24",
            "2008/02/01",
            "2015/09/16",
            "1992/01/23",
            "nan",
            "2018/12/02",
            "nan"

Date Format specified: "%Y/%m/%d"
Apoorva
  • 75
  • 6
  • 1
    can you provide input and output of example? https://stackoverflow.com/help/minimal-reproducible-example – Panda Kim Apr 26 '23 at 10:03
  • Input: 45678, Expected Output: 2024-06-27, Current Output: NaN – Apoorva Apr 26 '23 at 10:07
  • https://stackoverflow.com/questions/63963635/how-to-convert-a-column-with-excel-serial-dates-and-regular-dates-to-a-pandas-da – gerald Apr 26 '23 at 11:55
  • For Excel serial dates specifically, see [Convert Excel style date with pandas](https://stackoverflow.com/a/65460255/10197418) – FObersteiner Apr 26 '23 at 11:56
  • "45678" is the serial date. It's at the end of the list. – Apoorva Apr 26 '23 at 11:59
  • 1
    @Apoorva, sorry, overlooked that! Besides, how do you intend to distinguish between 'troublemakes' such as "01-02-2022" meaning Jan 2nd and "01-02-2022" meaning Feb 1st? – FObersteiner Apr 26 '23 at 12:01
  • @FObersteiner, According to my observation, when both the month & day numbers are <=12, the lowest number goes into month & the highest into day, but in a handful of cases, the numbers are left in whatever order they're in. Couldn't find a source on how pandas reads ambiguous dates. The above is based on my observation of the dates that I've posted as inputs in my question. – Apoorva Apr 27 '23 at 19:59
  • 1
    see [pd.to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html), keyword 'dayfirst'. That defaults to False, so whenever both numbers are <= 12, pandas assumes *month* comes first (unless you specify dafirst=True). – FObersteiner Apr 28 '23 at 06:21

2 Answers2

1

You use if df[col_name].str.isnumeric().all():

This checks if the entire column contains only numerical characters, which in this case will return False as there are dates separated by dashes too. The whole column is then coerced to data time, resulting in nan for the serial date.

I think what you wanted to do is to iterate by row. You could easily do this over the whole df using df.iterrows(). As you are relying so heavily on the default to_datetime() already anyway, you could also call this first and then iterate only over the rows where this throws an error.

Either way, you should go through the other inputs/outputs as well, as there seem to be some more errors (e.g. 2008,13,02 becomes "2008/02/01").

Anne
  • 56
  • 4
  • selecting the serial dates (purely numeric) is the **easy** part - if you have a look at the OPs "Input" example, the mix of formats in the datetime strings is what's causeing nightmares... – FObersteiner Apr 26 '23 at 14:00
  • I agree, but OP asked specifically why the serial dates weren't being converted so that's what I'm answering. I'll edit to add this though. – Anne Apr 26 '23 at 14:22
1

https://gist.github.com/OmarArain/9959241

try using pandas timedelta

date=5679

def convert_excel_time(excel_time):
    '''
    converts excel float format to pandas datetime object
    round to '1min' with 
    .dt.round('1min') to correct floating point conversion innaccuracy
    '''
    
    return pd.to_datetime('1899-12-30') + pd.to_timedelta(excel_time,'D')

print(convert_excel_time(date))

output:

 1915-07-19 00:00:00
Golden Lion
  • 3,840
  • 2
  • 26
  • 35