3

I have a large dataset consisting of records of objects travelling within a network. The dataset is stored in a pandas dataset, which roughly looks like this:

Obj ID | Timestamp | x | y | link ID
-------|-----------|---|---|---------
 123   |  506      |123|456|  ab12
 123   |  517      |129|436|  ab12
 123   |  519      |125|454|  cd34
 126   |  501      |123|426|  gk77
 126   |  505      |153|453|  gk77
 126   |  507      |129|856|  bj88
 126   |  508      |143|496|  bj88
 126   |  512      |125|427|  gk77
 126   |  515      |153|453|  gk77
 126   |  518      |127|256|  ab12

The dataframe has been sorted by Obj ID, and each block of records belonging to an object has been sorted by time (the Timestamp field in seconds). Two consecutive rows with the same Obj ID and link ID represent the start and end of an object of that link. The object eventually ends at a link, denoted by a single link ID appearing at the end of the object's record.

Here's the desired output. Condense the start-record and the end-record of one object's visit to one link into one record. For the ending-links of each object, just fill in StartTime for EndTime.

Obj ID | StartTime | EndTime | link ID
-------|-----------|---------|---------
 123   |  506      |   517   |  ab12
 123   |  519      |   519   |  cd34
 126   |  501      |   505   |  gk77
 126   |  507      |   508   |  bj88
 126   |  512      |   515   |  gk77
 126   |  518      |   518   |  ab12

Note:

  • Cannot guarantee that an object will not travel onto the same link twice. But each visit to the link needs to be a separate record.

  • A naive loop solution that I implemented ran out of memory, due to the dataset being quite large.

EDIT: I edited the dataset to include the situation mentioned in the first point of my "notes".

Kev W.
  • 159
  • 3
  • 10

2 Answers2

7

Let's try this:

g =(df['link ID'] != df.shift().fillna(method='bfill')['link ID']).cumsum().rename('group')

df.groupby(['Obj ID','link ID',g])['Timestamp'].agg(['min','max']).reset_index().rename(columns={'min':'StartTime','max':'EndTime'}).drop('group',axis=1)

Output:

    Obj ID link ID    StartTime      EndTime
0   123       ab12    506          517      
1   123       cd34    519          519      
2   126       ab12    518          518      
3   126       bj88    507          508      
4   126       gk77    501          505      
5   126       gk77    512          515      
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Hi, I edited the data slighly to include situations in which an object travel the same link twice. Will this still apply to that case? (I'm heading to test it as well.) Thanks for the answer anyways. – Kev W. Jun 05 '17 at 17:01
  • 2
    Can we explain what the first line of code is doing? – Jayyu Sep 28 '21 at 15:17
  • @Jayz Every time Link ID is different from Link ID on the next line, we increment by 1 using cumsum. So, we check to see if link Id is not equal to link id on the next line, this return s True, then we are doing a cumsum of the Trues. – Scott Boston Sep 28 '21 at 15:24
0
df.drop_duplicates(subset = ['Obj ID','link ID'], 
    keep='first').merge(x.drop_duplicates(subset = ['Obj ID','link ID'], 
    keep='last'), on= ['Obj ID','link ID'])
A.Kot
  • 7,615
  • 2
  • 22
  • 24
  • Hi, I edited the data slighly to include situations in which an object travel the same link twice. Will this still apply to that case? (I'm heading to test it as well.) Thanks for the answer anyways. – Kev W. Jun 05 '17 at 17:01