1

I have to calculate mean() of time column, but this column type is string, how can I do it?

id   time
1   1h:2m
2   1h:58m
3    35m
4    2h
...
Joe
  • 69
  • 1
  • 8

3 Answers3

2

You can use regex to extract hours and minutes. To calcualte the mean time in minutus:

h = df['time'].str.extract('(\d{1,2})h').fillna(0).astype(int)
m = df['time'].str.extract('(\d{1,2})m').fillna(0).astype(int)

(h * 60 + m).mean()

Result:

0    83.75
dtype: float64
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
1

It's largely inspired from How to construct a timedelta object from a simple string, but you can do as below:

def convertToSecond(time_str):
    regex=re.compile(r'((?P<hours>\d+?)h)?:*((?P<minutes>\d+?)m)?:*((?P<seconds>\d+?)s)?')
    parts = regex.match(time_str)
    if not parts:
        return

    parts = parts.groupdict()
    time_params = {}
    for (name, param) in parts.items():
        if param:
            time_params[name] = int(param)
    return timedelta(**time_params).total_seconds()

df = pd.DataFrame({
'time': ['1h:2m', '1h:58m','35m','2h'],})


df['inSecond']=df['time'].apply(convertToSecond)
mean_inSecond=df['inSecond'].mean()
print(f"Mean of Time Column: {datetime.timedelta(seconds=mean_inSecond)}")

Result:

Mean of Time Column: 1:23:45
Renaud
  • 2,709
  • 2
  • 9
  • 24
1

Another possibility is to convert your string column into timedelta (since they don't seem to be times but rather durations?).

Since your strings are not all formatted equally, you unfortinately cannot use pandas' to_timedelta function. However, parser from dateutil has an option fuzzy that you can use to convert your column to datetime. If you subtract midnight today from that, you get the value as a timedelta.

import pandas as pd 
from dateutil import parser
from datetime import date
from datetime import datetime

df = pd.DataFrame([[1,'1h:2m'],[2,'1h:58m'],[3,'35m'],[4,'2h']],columns=['id','time'])

today = date.today()
midnight = datetime.combine(today, datetime.min.time())

df['time'] = df['time'].apply(lambda x: (parser.parse(x, fuzzy=True)) - midnight)

This will convert your dataframe like this (print(df)):

   id     time
0   1 01:02:00
1   2 01:58:00
2   3 00:35:00
3   4 02:00:00

from which you can calculate the mean using print(df['time'].mean()):

0 days 01:23:45

Full example: https://ideone.com/Aze9mR

buddemat
  • 4,552
  • 14
  • 29
  • 49