How to compute the means of data separated by dashes with pandas

Question

I am trying to clean csv files so I retrieved all values which contained dashes in my ages column and I have this output

504       40-49
756       20-29
758       40-89

I would like to have the age mean instead of recording the age range as start_age-end_age.
I tried to compute the means but I cannot convert it to numeric, due to the dashes. I am expected this output :

504 45
756 25

Trenton McKinney · Answer 1 · 2020-05-19T19:55:53.987

To create the desired output, add age_mean and drop age_range.
pandas.Series.str.split
pandas.DataFrame.astype
pandas.DataFrame.mean
pandas.DataFrame.drop
Creating the age_mean can be accomplished with one line of code and without creating intermediary columns.
As per Python 3 rounding behavior, pandas.DataFrame.round doesn't round as one would expect, so I'm leaving the final answer as a decimal.
- .round() can be added as .mean(axis=1).round() if you want.

import pandas as pd

# crate dataframe
df = pd.DataFrame({'id': [504, 756, 758], 'age_range': ['40-49', '20-29', '40-89']})

|    |   id | age_range   |
|---:|-----:|:------------|
|  0 |  504 | 40-49       |
|  1 |  756 | 20-29       |
|  2 |  758 | 40-89       |

# create age_mean
df['age_mean'] = df['age_range'].str.split('-', expand=True).astype('int').mean(axis=1)

# drop age_range
df.drop(columns=['age_range'], inplace=True)

# final dataframe
|    |   id |   age_mean |
|---:|-----:|-----------:|
|  0 |  504 |       44.5 |
|  1 |  756 |       24.5 |
|  2 |  758 |       64.5 |

Mayank Porwal · Answer 2 · 2020-05-19T19:45:25.043

Considering below to be your dataframe:

In [966]: df
Out[966]: 
     a    age
0  504  40-49
1  756  20-29
2  758  40-89

You can do this:

## split the `age` column into 2 columns
In [971]: df[['start_age','end_age']] = df['age'].str.split('-', expand=True) 

## convert these new columns to `int` 
In [979]: df.start_age = df.start_age.astype(int)    
In [980]: df.end_age = df.end_age.astype(int)

## Take mean of the row and round it.
In [983]: df['mean'] = df[['start_age', 'end_age']].mean(axis=1).round()

If you want only 2 columns, you can do this:

In [986]: df = df[['a', 'mean']] 
In [987]: df
Out[987]: 
Out[1010]: 
     a  mean
0  504  44.5
1  756  24.5
2  758  64.5

How to compute the means of data separated by dashes with pandas

2 Answers2