1

I have a dataframe with different timestamp for each user, and I want to calculate the duration. I used this code to import my CSV files:

import pandas as pd
import glob

path = r'C:\Users\...\Desktop' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0,encoding='ISO-8859-1')
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

df.head()

  ID     timestamp
1828765  31-05-2021 22:27:03    
1828765  31-05-2021 22:27:12    
1828765  31-05-2021 22:27:13    
1828765  31-05-2021 22:27:34
2056557  21-07-2021 10:27:12
2056557  21-07-2021 10:27:20
2056557  21-07-2021 10:27:22

And I want to get something like that

   ID    timestamp             duration(s)
1828765  31-05-2021 22:27:03    NAN
1828765  31-05-2021 22:27:12    9
1828765  31-05-2021 22:27:13    1
1828765  31-05-2021 22:27:34    21
2056557  21-07-2021 10:27:12    NAN
2056557  21-07-2021 10:27:20    8
2056557  21-07-2021 10:27:22    2

I've used this code, but doesn't work for me

import datetime
df['timestamp'] =  pd.to_datetime(df['timestamp'], format = "%d-%m-%Y %H:%M:%S") 
df['time_diff'] = 0
for i in range(df.shape[0] - 1):
    df['time_diff'][i+1] = (datetime.datetime.min +  (df['timestamp'][i+1] - df['timestamp'][i])).time()
Dharman
  • 30,962
  • 25
  • 85
  • 135
Hermoine
  • 63
  • 7
  • 1
    Does this answer your question? [Python finding difference between two time stamps in minutes](https://stackoverflow.com/questions/36481189/python-finding-difference-between-two-time-stamps-in-minutes) – adist98 Oct 24 '21 at 16:22
  • Please trim your code to make it easier to find your problem. Follow these guidelines to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Community Oct 24 '21 at 17:25
  • @HenryEcker I just read a csv file in jupyter and what you can see in the description is a few lines from the dataframe.. – Hermoine Oct 24 '21 at 17:46
  • No worries, just keep in mind for the future; these tags would be appropriate if your issue was *related* to Jupyter (notebook). – desertnaut Oct 24 '21 at 17:55
  • @Hermoine this is a much improved question. I've retracted my downvote and also voted to reopen this question. I appreciate your willingness to provide more information and get it into the type of format we're looking for here. – Henry Ecker Oct 24 '21 at 18:38
  • I've used df['time_diff'][i+1] = df['timestamp'][i+1] - df['timestamp'][i] and it works :D Problem Resolved – Hermoine Oct 24 '21 at 18:39
  • The question will be eventually reopened, in which case it is highly advisable that you post this as an answer and accept it (after 48 hours). – desertnaut Oct 24 '21 at 19:58

1 Answers1

0

Operations which occur over groups of values are GroupBy operations in pandas.

pandas supports mathematical operations over timestamps natively. For this reason, subtraction will give the correct duration between any two timestamps.

We've already successfully converted out timestamp column to datetime64[ns]

df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")

Now we can take the difference between rows within groups with Groupby.diff

df['duration'] = df.groupby('ID')['timestamp'].diff()

df

        ID           timestamp        duration
0  1828765 2021-05-31 22:27:03             NaT
1  1828765 2021-05-31 22:27:12 0 days 00:00:09
2  1828765 2021-05-31 22:27:13 0 days 00:00:01
3  1828765 2021-05-31 22:27:34 0 days 00:00:21
4  2056557 2021-07-21 10:27:12             NaT
5  2056557 2021-07-21 10:27:20 0 days 00:00:08
6  2056557 2021-07-21 10:27:22 0 days 00:00:02

If we want to get the duration in seconds we can extract the total number of seconds using Series.dt.total_seconds:

df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()

df:

        ID           timestamp  duration (s)
0  1828765 2021-05-31 22:27:03           NaN
1  1828765 2021-05-31 22:27:12           9.0
2  1828765 2021-05-31 22:27:13           1.0
3  1828765 2021-05-31 22:27:34          21.0
4  2056557 2021-07-21 10:27:12           NaN
5  2056557 2021-07-21 10:27:20           8.0
6  2056557 2021-07-21 10:27:22           2.0

Complete Working Example:

import pandas as pd

df = pd.DataFrame({
    'ID': [1828765, 1828765, 1828765, 1828765, 2056557, 2056557, 2056557],
    'timestamp': ['31-05-2021 22:27:03', '31-05-2021 22:27:12',
                  '31-05-2021 22:27:13', '31-05-2021 22:27:34',
                  '21-07-2021 10:27:12', '21-07-2021 10:27:20',
                  '21-07-2021 10:27:22']
})

df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")
df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()
print(df)
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57