28

How can I calculate the age of a person (based off the dob column) and add a column to the dataframe with the new value?

dataframe looks like the following:

    lname      fname     dob
0    DOE       LAURIE    03011979
1    BOURNE    JASON     06111978
2    GRINCH    XMAS      12131988
3    DOE       JOHN      11121986

I tried doing the following:

now = datetime.now()
df1['age'] = now - df1['dob']

But, received the following error:

TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'

vino88
  • 163
  • 13
Dave
  • 6,968
  • 7
  • 26
  • 32

7 Answers7

50
import datetime as DT
import io
import numpy as np
import pandas as pd

pd.options.mode.chained_assignment = 'warn'

content = '''     ssno        lname         fname    pos_title             ser  gender  dob 
0    23456789    PLILEY     JODY        BUDG ANAL             0560  F      031871 
1    987654321   NOEL       HEATHER     PRTG SRVCS SPECLST    1654  F      120852
2    234567891   SONJU      LAURIE      SUPVY CONTR SPECLST   1102  F      010999
3    345678912   MANNING    CYNTHIA     SOC SCNTST            0101  F      081692
4    456789123   NAUERTZ    ELIZABETH   OFF AUTOMATION ASST   0326  F      031387'''

df = pd.read_csv(io.StringIO(content), sep='\s{2,}')
df['dob'] = df['dob'].apply('{:06}'.format)

now = pd.Timestamp('now')
df['dob'] = pd.to_datetime(df['dob'], format='%m%d%y')    # 1
df['dob'] = df['dob'].where(df['dob'] < now, df['dob'] -  np.timedelta64(100, 'Y'))   # 2
df['age'] = (now - df['dob']).astype('<m8[Y]')    # 3
print(df)

yields

        ssno    lname      fname            pos_title   ser gender  \
0   23456789   PLILEY       JODY            BUDG ANAL   560      F   
1  987654321     NOEL    HEATHER   PRTG SRVCS SPECLST  1654      F   
2  234567891    SONJU     LAURIE  SUPVY CONTR SPECLST  1102      F   
3  345678912  MANNING    CYNTHIA           SOC SCNTST   101      F   
4  456789123  NAUERTZ  ELIZABETH  OFF AUTOMATION ASST   326      F   

                  dob  age  
0 1971-03-18 00:00:00   43  
1 1952-12-08 18:00:00   61  
2 1999-01-09 00:00:00   15  
3 1992-08-16 00:00:00   22  
4 1987-03-13 00:00:00   27  

  1. It looks like your dob column are currently strings. First, convert them to Timestamps using pd.to_datetime.
  2. The format '%m%d%y' converts the last two digits to years, but unfortunately assumes 52 means 2052. Since that's probably not Heather Noel's birthyear, let's subtract 100 years from dob whenever the dob is greater than now. You may want to subtract a few years to now in the condition df['dob'] < now since it may be slightly more likely to have a 101 year old worker than a 1 year old worker...
  3. You can subtractdob from now to obtain timedelta64[ns]. To convert that to years, use astype('<m8[Y]') or astype('timedelta64[Y]').
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I get the following error from the code above when running in a .py file. What does this mean SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead df1['dob'] = pd.to_datetime(df1['dob'], format='%m%d%y') c:\users\davidl~1\appdata\local\temp\1\tmpxt4mqz.py:37: Try using .loc[row_index,col_indexer] = value instead df1['dob'] = df1['dob'].where(df1['dob'] < now, df1['dob'] - np.timedelta64(100, 'Y')) – Dave Nov 08 '14 at 00:00
  • That warning is not an error, but it is a warning that the code *may* be assigning to a copy of the data rather than the original data in the DataFrame. See [this answer](http://stackoverflow.com/a/21463854/190597) and [the docs](http://pandas.pydata.org/pandas-docs/dev/indexing.html#why-does-the-assignment-when-using-chained-indexing-fail). I think it is a false alarm in the context of the code above, but I am uncertain why you are seeing the warning at all, since when I run the code above I see no warning. When you run the code above, do you see the warning? – unutbu Nov 08 '14 at 01:01
  • I'm sorry, I'm getting a ValueError with the code above. – Dave Nov 08 '14 at 17:12
  • in ipython the original code works great. When creating a function in a .py file it gives me the error. – Dave Nov 08 '14 at 17:13
  • Can you post a few lines of `eeprofil.txt` so the error you are seeing (is it a ValueError or the SettingWithCopyWarning?) is reproducible? – unutbu Nov 08 '14 at 18:04
  • 1
    When I run your code on the CSV data you posted, I get the correct result with no errors. – unutbu Nov 09 '14 at 18:21
  • I cleaned up my original code so that the date comes across as four digit years. Maybe this will help. – Dave Nov 10 '14 at 18:38
  • @coldspeed: Thanks for the suggested improvements, but I think there are two problems with using `pd.compat.StringIO`. (1) In Python2 `pd.compat.StringIO` points to `StringIO.StringIO`. But in Python2, `pd.read_table` expects `BytesIO`, not `StringIO`. (2) The [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) warns that `pd.compat` is private, and its functionality is not guaranteed. I'd be willing to overlook (2), but (1) is a deal-breaker. – unutbu Apr 14 '19 at 01:42
  • @unutbu I was updating this to work with more recent versions of pandas. Read table seems to be deprecated in 0.24, so I changed the function to `read_csv`, which I don't think has a problem reading `StringIO` input... I could be mistaken. – cs95 Apr 14 '19 at 01:43
  • Yes, I agree at this point I'd rather support Python3. So I've changed the code to simply use `io.StringIO`. Changing `read_table` to `read_csv` is a good change too (thank you for that). – unutbu Apr 14 '19 at 01:46
  • 1
    I don't have a Python2 with pandas to test this out on anymore, but if I recall correctly, both `read_table` and `read_csv` need `BytesIO` and raise some error if passed `StringIO`. – unutbu Apr 14 '19 at 01:48
  • Point 3 is no longer supported in Pandas 2.0 [from what I read](https://pandas.pydata.org/docs/whatsnew/v2.0.0.html?highlight=sum#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution) - gives `ValueError: Cannot convert from timedelta64[ns] to timedelta64[Y]. Supported resolutions are 's', 'ms', 'us', 'ns'` – Rafs Jul 19 '23 at 08:58
19

I found easier solution:

import pandas as pd
from datetime import datetime
from datetime import date

d = {'col0': [1, 2, 6], 
     'col1': [3, 8, 3], 
     'col2': ['17.02.1979', '11.11.1993', '01.08.1961']}

df = pd.DataFrame(data=d)

def calculate_age(born):
    born = datetime.strptime(born, "%d.%m.%Y").date()
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

df['age'] = df['col6'].apply(calculate_age)
print(df)

output:

     col0  col1  col3        age
0       1     3  17.02.1979   39
1       2     8  11.11.1993   24
2       6     3  01.08.1961   57
Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
nnaqa
  • 259
  • 2
  • 4
7
# Data setup
df

    lname   fname        dob
0     DOE  LAURIE 1979-03-01
1  BOURNE   JASON 1978-06-11
2  GRINCH    XMAS 1988-12-13
3     DOE    JOHN 1986-11-12

# Make sure to parse all datetime columns in advance
df['dob'] = pd.to_datetime(df['dob'], errors='coerce')

If you want only the year portion of the age, use @unutbu's solution. . .

now = pd.to_datetime('now')
now
# Timestamp('2019-04-14 00:00:43.105892')

(now - df['dob']).astype('<m8[Y]') 

0    40.0
1    40.0
2    30.0
3    32.0
Name: dob, dtype: float64

Another option is to subtract the year portion and account for the month difference using

(now.year - df['dob'].dt.year) - ((now.month - df['dob'].dt.month) < 0)

0    40
1    40
2    30
3    32
Name: dob, dtype: int64

If you want the (almost) precise age (including the fractional portion), query total_seconds and divide.

(now - df['dob']).dt.total_seconds() / (60*60*24*365.25)

0    40.120446
1    40.840501
2    30.332630
3    32.418872
Name: dob, dtype: float64
cs95
  • 379,657
  • 97
  • 704
  • 746
3

Use this one liner when you are trying to find the age from date of birth column with current year

import pandas as pd

df["dob"] = pd.to_datetime(data["dob"])

df["age"] = df["dob"].apply(lambda x : (pd.datetime.now().year - x.year))
Tanmay
  • 111
  • 1
  • 6
2

First thought is that your years are two digit, which is a not great choice in this day and age. In any case, I'm going to assume that all years like 05 are actually 1905. This is probably not correct(!) but coming up with the right rule is going to depend a lot on your data.

from datetime import date

def age(date1, date2):
    naive_yrs = date2.year - date1.year
    if date1.replace(year=date2.year) > date2:
        correction = -1
    else:
        correction = 0
    return naive_yrs + correction

df1['age'] = df1['dob'].map(lambda x: age(date(int('19' + x[-2:]), int(x[:2]), int(x[2:-2])), date.today()))
Brandon Humpert
  • 322
  • 2
  • 11
1

What about the following solution:

import datetime as dt
import numpy as np
import pandas as pd
from dateutil.relativedelta import relativedelta

df1['age'] = [relativedelta(pd.to_datetime('now'), d).years for d in df1['dob']]
user4424563
  • 115
  • 2
0

once you have the year, month, and day part of DOB separated, you can use the below-given lines to get age in no. of years and months.

tmpdf = df[['born_year','born_month','born_day']].copy()
tmpdf.columns = ["year", "month", "day"]
df['dob']=pd.to_datetime(tmpdf , errors='coerce')
df['age_y']=(datetime.today()-df['dob']).dt.days/365.25
df['age_y']=df['age_y'].astype(int)
df['age_m']=((datetime.today()-df['dob']).dt.days/365.25 -  df['age_y'] ) * 12
df['age_m']=df['age_m'].astype(int)
Willie Cheng
  • 7,679
  • 13
  • 55
  • 68