0

I have a dataset with wrong times (24:00:00 to 26:18:00) I wanted to know what is the best approach to deal with this kind of data in python.

I tried to convert the column from object to datetime using this code:

stopTimeArrDep['departure_time'] =  pd.to_datetime(stopTimeArrDep['departure_time']\
                                                   ,format='%H:%M:%S')

But I get this error:

ValueError: time data '24:04:00' does not match format '%H:%M:%S' (match)

So I tried adding errors='coerce' to avoid this error. But I end up with empty columns and unwanted date added to every row.

stopTimeArrDep['departure_time'] =  pd.to_datetime(stopTimeArrDep['departure_time']\
                                                   ,format='%H:%M:%S',errors='coerce')

output sample:

original_col    converted_col
23:45:00        1/1/00 23:45:00
23:51:00        1/1/00 23:51:00
24:04:00
23:42:00        1/1/00 23:42:00
26:01:00

Any suggestion on what is the best approach to handle this issue. Thank you,

leena
  • 563
  • 1
  • 8
  • 25
  • The reason you are getting error is because the `H` value you have is more than 23:59:59 (a day has only 24 hours). Those values cannot be 'coerced' either. If you a have date along-with time, you can bump up the date to next day & hours could start from 0 after 24 has been reached. – moys Oct 24 '19 at 04:17
  • Thanks for your reply. I only have times. – leena Oct 24 '19 at 04:43

1 Answers1

0

Solution

You could treat the original_col as some elapsed time interval and not time, if that makes any sense. You could use datetime.timedelta and then add this datetime.timedelta to a datetime.datetime to get some datetime object; which you could finally use to get the date and time separately.

Example

from datetime import datetime, timedelta

time_string = "20:30:20"

t = datetime.utcnow()
print('t: {}'.format(t))
HH, MM, SS = [int(x) for x in time_string.split(':')]
dt = timedelta(hours=HH, minutes=MM, seconds=SS)
print('dt: {}'.format(dt))
t2 = t + dt
print('t2: {}'.format(t2))
print('t2.date: {} | t2.time: {}'.format(str(t2.date()), str(t2.time()).split('.')[0]))

Output:

t: 2019-10-24 04:43:08.255027
dt: 20:30:20
t2: 2019-10-25 01:13:28.255027
t2.date: 2019-10-25 | t2.time: 01:13:28

For Your Usecase

# Define Custom Function
def process_row(time_string):
    HH, MM, SS = [int(x) for x in time_string.split(':')]
    dt = timedelta(hours=HH, minutes=MM, seconds=SS)
    return dt

# Make Dummy Data
original_col = ["23:45:00", "23:51:00", "24:04:00", "23:42:00", "26:01:00"]
df = pd.DataFrame({'original_col': original_col, 'dt': None})

# Process Dataframe
df['dt'] = df.apply(lambda x: process_row(x['original_col']), axis=1)
df['t'] = datetime.utcnow()
df['t2'] = df['dt'] + df['t']
# extracting date from timestamp
df['Date'] = [datetime.date(d) for d in df['t2']] 
# extracting time from timestamp
df['Time'] = [datetime.time(d) for d in df['t2']] 
df

Output:
enter image description here

Using pandas.to_datetime():

pd.to_datetime(df['t2'], format='%H:%M:%S',errors='coerce')

Output:

0   2019-10-25 09:38:39.349410
1   2019-10-25 09:44:39.349410
2   2019-10-25 09:57:39.349410
3   2019-10-25 09:35:39.349410
4   2019-10-25 11:54:39.349410
Name: t2, dtype: datetime64[ns]

References

  1. How to construct a timedelta object from a simple string
CypherX
  • 7,019
  • 3
  • 25
  • 37
  • Thank you for your reply. I am not sure how to reflect this to my dataframe column. Could you clarify it to me. Thanks – leena Oct 24 '19 at 04:43
  • Well, if you consider a single row in `original_col`, what I mentioned above, could be applied to that. So, you could make a custom function where you just pass in the `time_string` and some datetime (as reference: this will replace `t`) and out comes t2. This could by applied to each row and that could be finally stored in a new column. You would need to use `pandas.Series.apply` for this purpose. – CypherX Oct 24 '19 at 04:51
  • I will add some more stuff for more clarity. – CypherX Oct 24 '19 at 05:16
  • @leena See the section: **For Your Usecase** for an example. – CypherX Oct 24 '19 at 05:19
  • Thank you CypherX. But time column is still an `object` not `datetime`. – leena Oct 24 '19 at 05:49
  • And returns few errors when I replace `df` with my dataframe name. And `original_column` with the column name. Such as: `KeyError: ('departure_time', 'occurred at index 0')` – leena Oct 24 '19 at 05:52
  • @leena The solution that I provided with the data, please confirm if that is reproducible or not. Since I do not have access to your actual data, I would not be able to diagnose/fix the errors you are getting. Here `t2` is the `datetime`. You will have to use the ideas shared here to adapt to your specific usecase. – CypherX Oct 24 '19 at 09:47
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/201374/discussion-between-cypherx-and-leena). – CypherX Oct 24 '19 at 09:58