2

assume df as;

data = {'duration':['1week 3day 2hour 4min 23', '2hour 4min 23sec', '2hour 4min', np.nan, '', '23sec']}
df = pd.DataFrame(data)

I'm trying to calculate the duration as sum of seconds. Replaced the values as:

df['duration'] = df['duration'].str.replace('week', '*604800+') \
                                                                 .str.replace('day', '*604800+') \
                                                                 .str.replace('hour', '*3600+') \
                                                                 .str.replace('min', '*60+') \
                                                                 .str.replace('sec', '')  \
                                                                 .str.replace(' ', '')

But cant run eval functions like (pd.eval, apply.eval, eval etc). Some cells ends with '+' sign or other string/na problems.. Any help?

Ps: This is not a duplicate question.

maxime
  • 25
  • 5
  • "other string/na problems": Can you be more specific? – Scott Hunter May 16 '22 at 12:46
  • I'd write this as a function to work on an individual string (or even just a part of the string, such as `1week`) then use `apply` on your dataframe. You probably don't want to use `eval` here – Steve May 16 '22 at 12:47
  • Empty strings causes exception, eval functions doesnt accept them. I'm not sure but i think na values causes exceptions too, error message (dtype error) was not clear enough to be sure. – maxime May 16 '22 at 12:49
  • The entry `2hour 4min` will give you a trailing `+` which you could check for, but I think you're taking the wrong approach altogether. – Steve May 16 '22 at 12:49
  • I'm open for any solution/approach – maxime May 16 '22 at 12:50
  • To justify "probably don't want to use `eval`": [Why is using 'eval' a bad practice?](https://stackoverflow.com/q/1832940/5358968) but pandas eval might be alright. – Steve May 16 '22 at 14:03

2 Answers2

2

You can use a regex combined to a custom function to replace weeks by 7 days and add seconds on lonely numbers (you can add other units). Then convert to_timedelta:

def change_units(m):
    d = {'week': (7, 'days'), '': (1, 's')}
    _, i, period = m.groups()
    factor, txt = d[period]
    return f'{factor*int(i)}{txt}'
    
df['delta'] = pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
                                                         replace, regex=True))

output:

                   duration            delta
0  1week 3day 2hour 4min 23 10 days 02:04:23
1          2hour 4min 23sec  0 days 02:04:23
2                2hour 4min  0 days 02:04:00
3                       NaN              NaT
4                                        NaT
5                     23sec  0 days 00:00:23

Then you can benefit from the TimeDelta object, for example to convert to total_seconds:

pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
                                           change_units, regex=True)
               ).dt.total_seconds()

output:

0    871463.0
1      7463.0
2      7440.0
3         NaN
4         NaN
5        23.0
Name: duration, dtype: float64
mozway
  • 194,879
  • 13
  • 39
  • 75
  • @maxime have you checked the second part of the answer? – mozway May 16 '22 at 12:56
  • Thanks. it works on sample df but not on real one. exception as: `symbols w/o a number. During handling of the above exception, another exception occurred: ValueError: symbols w/o a number` BTW I think code skips rows begin with hour(row 3 become na) – maxime May 16 '22 at 13:13
  • this should be quite easy to fix, can you provide example of such rows? – mozway May 16 '22 at 13:14
  • regarding the row 3 it was a NaN in your example, updated the answer to show the before/after conversion side by side, so there is no issue here – mozway May 16 '22 at 13:17
  • Sorry you are right, i forgot the blank string value. BTW exception message does not show the row that caused error – maxime May 16 '22 at 13:23
  • you can add `errors='coerce'` as parameter to `to_timedelta` to transform the invalid rows to NaT – mozway May 16 '22 at 13:25
  • I was just trying it right now. it worked, thanks for your help – maxime May 16 '22 at 13:29
  • let me know if you identify the invalid rows and need to handle them differently – mozway May 16 '22 at 13:32
1

I had a different approach to the excellent accepted answer:

I wrote a function to convert the string into seconds:

def convert_all(s):
    if not isinstance(s, str):
        # E.g. for np.nan
        return s
    return sum(convert_part(part) for part in s.split())

def  convert_part(part):
    """Convert an individual segment into seconds.

    >>> convert_part('1day')
    86400.0
    """
    if part.isnumeric():
        return float(part)
    in_seconds = {'week': 7*24*60*60, 'day': 24*60*60, 'hour': 60*60, 'min': 60, 'sec': 1}
    for k,v in in_seconds.items():
        if part.endswith(k):
            return float(part.strip(k))*v
    else:
        # Handle error here - just printing for now
        print(part)
        return 0.0

then you can use Series.apply:

df['duration_sec'] = df['duration'].apply(convert_all)

Output:

                   duration  duration_sec
0  1week 3day 2hour 4min 23      871463.0
1          2hour 4min 23sec        7463.0
2                2hour 4min        7440.0
3                       NaN           NaN
4                                     0.0
5                     23sec          23.0
Steve
  • 1,579
  • 10
  • 23