1

I need to calculate elapsed time between events. My task is similar to this one but I got an error when I try to reproduce it:

print (df1.sort_values(['ip','timestamp']).head(20))
df1['diff'] = df1.sort_values(['ip','timestamp']).groupby('ip')['timestamp'].diff()

                 ip           timestamp
26422    1.0.150.87 2021-08-21 03:17:00
26192    1.0.150.87 2021-08-21 03:17:00
77885   1.0.155.191 2021-08-22 05:54:00
77387   1.0.155.191 2021-08-22 05:54:00
27240    1.0.227.92 2021-08-21 03:47:00
27009    1.0.227.92 2021-08-21 03:47:00
47641  1.10.130.122 2021-08-21 13:44:00
47279  1.10.130.122 2021-08-21 13:44:00
11912   1.10.202.23 2021-08-20 16:59:00
11825   1.10.202.23 2021-08-20 16:59:00
92     1.10.213.176 2021-08-20 12:02:00
96     1.10.213.176 2021-08-20 12:02:00
2580   1.10.213.176 2021-08-20 13:09:00
2572   1.10.213.176 2021-08-20 13:09:00
4518   1.10.213.176 2021-08-20 13:57:00
4491   1.10.213.176 2021-08-20 13:57:00
8057   1.10.214.251 2021-08-20 15:23:00
8017   1.10.214.251 2021-08-20 15:23:00
35302   1.10.219.41 2021-08-21 08:09:00
35030   1.10.219.41 2021-08-21 08:09:00
Traceback (most recent call last):
  File "./analyser.py", line 59, in <module>
    df1['diff'] = df1.sort_values(['ip','timestamp']).groupby('ip')['timestamp'].diff()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3607, in __setitem__
    self._set_item(key, value)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3779, in _set_item
    value = self._sanitize_column(value)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 4501, in _sanitize_column
    return _reindex_for_setitem(value, self.index)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 10777, in _reindex_for_setitem
    raise err
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 10772, in _reindex_for_setitem
    reindexed_value = value.reindex(index)._values
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/series.py", line 4579, in reindex
    return super().reindex(index=index, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/generic.py", line 4809, in reindex
    return self._reindex_axes(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/generic.py", line 4830, in _reindex_axes
    obj = obj._reindex_with_indexers(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/generic.py", line 4874, in _reindex_with_indexers
    new_data = new_data.reindex_indexer(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 666, in reindex_indexer
    self.axes[axis]._validate_can_reindex(indexer)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3785, in _validate_can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

Can't figure why it is not working? Also I wonder if there is a better way to solve this, for example, using 'native' Python's functionality? Thank you for your help!

Agenobarb
  • 143
  • 2
  • 10

1 Answers1

1

Use DataFrame.sort_values and assign back with ignore_index=True first:

df1 = df1.sort_values(['ip','timestamp'], ignore_index=True)
df1['diff'] = df1.groupby('ip')['timestamp'].diff()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252