2

I have a database with a column named ['birth_date'], already converted string -> date using:

dataCopy.loc[:,'birth_date'] = dataCopy['birth_date'].astype('datetime64[ns]')

I also converted other columns my db has. So, as some of you know there is an issue with 2 digits year dates (mm/dd/yy or whatever) that when python sees a date like mm/dd/69 and below it assumes the year is 2069 and not 1969. Problem is I need to subtract this column with another column to pick the age my customer had when he canceled the service. Example: He was born in 1969 and canceled the service in 2019, so he was 53 years old. I already know how I can do it:

dataCopy['idade'] = (dataCopy['deleted_at'].dt.year - dataCopy['birth_date'].dt.year)

But first I need to fix the wrong years. Using format (y%m%d and variations) doesn't work. I mean, they work but they don't fix the wrong years. I am a beginner, already tried functions I saw here on Stack but I couldn't modify it to match my problem (plus I didn't understand it 100%). I appreciate any help.

2 Answers2

1

You need to create a custom function and map it to the birth_date column.

You can decide a cutoff year (say 40) above which will be classified to 19th century and below which will be classified to 20th century. For example, 62 will be converted to year 1962, 32 will be converted to 2032.

The below code create the custom function that converts the datestring.

import pandas as pd
import datetime as dt


def custom_date_function(date_string: str) -> dt.date:
    """
    Convert date string to date object
    """
    # Note that the first 8 character is the date without the time
    # Selecting the first 8 character
    # And then splitting the string using '/' to year, month and date
    date_components = date_string[0:8].split('/')

    # All number greater than 40 will be changed to 19th century
    # Else to 20th century
    # You may change the cutoff from 40

    if int(date_components[2]) >= 40:
        year = 1900 + int(date_components[2])
    else:
        year = 2000 + int(date_components[2])

    return dt.date(year=year, month=int(date_components[0]), day=int(date_components[1]))

Once the custom function is created, you may use it on your birth_date column.

# Example Code of applying the custom function on birth_date DataFrame column

# Creating an example DataFrame with birth_date column
df_dict = {'birth_date': ['11/22/67', '03/23/69', '11/22/27']}
dataCopy = pd.DataFrame(df_dict)

# Applying the function on birth_date DataFrame column
out = dataCopy['birth_date'].apply(custom_date_function)
print(out)

There is a possibility that birth_date column is already a date object. In that case you need to convert it to string before applying the custom_date_function.

  • I am sorry I forgot to mention all [birth_date] rows are using hours too "07/10/74 12:00 AM". Also I didn't understand the dictionary part, is it part of the code or just exemplification? And inside the function I should convert datetime to date object (undo what I did)? Why? – Caroline Griffin Feb 18 '22 at 15:11
  • I used dictionary in a way to create the pandas DataFrame. It is just an example. In your case, you can skip that part. Date objects are useful because, you can do simple arithmetic operations, like finding difference in year, days, etc between two dates. I have edited the answer to make it work with your date format. – Fahid Latheef A Feb 19 '22 at 06:21
1

slightly simpler than @Fahids solution but using the same conditional approach

import pandas as pd

# produce sample data
df = pd.DataFrame({'date': pd.to_datetime(['01-01-34', '01-01-66', '01-01-19', '01-01-20'], format='%m-%d-%y'), 'value': [1, 2, 3, 4]})
print(df)

output

        date  value
0 2034-01-01      1
1 2066-01-01      2
2 2019-01-01      3
3 2020-01-01      4


# define latest possible date the data can be from
cutoff_date = pd.to_datetime('01-01-2020')

# substract 100 years from date > cutoff date
df.loc[df.date > cutoff_date, 'date'] -= pd.DateOffset(years=100)

print(df)

output

        date  value
0 1934-01-01      1
1 1966-01-01      2
2 2019-01-01      3
3 2020-01-01      4
AlexWach
  • 592
  • 4
  • 16