1

I am working through a problem from the VAST 2015 challenge (http://vacommunity.org/VAST+Challenge+2015).

I have a large df with data points tracking (fictitious) individuals as they move around a theme park. The individuals each have a unique ID, and the event type is either 'check-in' or 'movement'. There are circa 500-1000 data points for each user ID (ie. multiple check-ins and multiple movements recorded for each ID). The df has the following columns;

 Timestamp    ID    Type X-coordinate   Y-Coordinate

0 2014-6-07 08:00:08    1102394  check-in  99 77

...

21 2014-6-07 08:00:15    1102394  movement  99 76  etc

There are 6411 unique IDs in the dataframe. However, from creating two new dataframes;

df_checkin = df.loc[df['type'] == "check-in"]
df_movement = df.loc[df['type'] == "movement"]

and performing ID value_counts across these, it shows that whilst 6411 of the IDs record a 'movement', only 6410 record a 'check-in'.

How can I compare these value counts (or go back to the original df), to identify which of the 6411 IDs present in the dataset recorded movement but not a check-in?

Thanks!

Edited to include more data;

Movement df

Timestamp   id  type    X   Y
21  2014-6-07 08:00:27  657863  movement    7   43
27  2014-6-07 08:00:35  657863  movement    8   43
28  2014-6-07 08:00:36  1187304 movement    63  98
32  2014-6-07 08:00:39  1187304 movement    63  97
33  2014-6-07 08:00:42  1187304 movement    64  96
... ... ... ... ... ...
9078618 2014-6-07 23:35:24  694782  movement    63  99
9078619 2014-6-07 23:35:24  800018  movement    63  99
9078620 2014-6-07 23:35:24  9591    movement    63  99
9078621 2014-6-07 23:35:24  963248  movement    63  99
9078622 2014-6-07 23:35:24  972171  movement    63  99

Check-in df

Timestamp   id  type    X   Y
0   2014-6-07 08:00:08  1102394 check-in    99  77
1   2014-6-07 08:00:08  1187304 check-in    63  99
2   2014-6-07 08:00:08  1363700 check-in    99  77
3   2014-6-07 08:00:10  1449032 check-in    63  99
4   2014-6-07 08:00:10  279658  check-in    63  99
... ... ... ... ... ...
9044493 2014-6-07 23:04:36  90727   check-in    42  37
9044495 2014-6-07 23:04:37  878118  check-in    42  37
9044496 2014-6-07 23:04:38  2044279 check-in    60  37
9044497 2014-6-07 23:04:39  1933890 check-in    60  37
9044498 2014-6-07 23:04:39  1962745 check-in    60  37

Movement Value Counts

999107     2549
1723967    2528
685884     2523
697057     2518
1928560    2382
           ... 
1523741     343
1306211     326
1187304     309
590879      248
657863       73
Name: id, Length: 6411, dtype: int64

Check-in value counts

999107     43
1349992    41
204640     40
1928560    40
1285841    40
           ..
953838      1
1458915     1
1748887     1
1680161     1
417205      1
Name: id, Length: 6410, dtype: int64
Coralie E
  • 11
  • 2
  • Hi there, welcome to SO, as your data is ficitious can you post the top 5 rows as well as your target output? from the sounds of it you can use `isin` i.e `df[df['ID'].isin(df_checkin['ID'])` but it's hard to say without a [mcve] (external links aren't the norm here, usually you need to provide either a copy&paste example or the code to reproduce your dataframe) – Umar.H Mar 30 '20 at 16:19
  • Thank you - added further details! – Coralie E Mar 30 '20 at 16:58
  • This would help https://stackoverflow.com/questions/20225110/comparing-two-dataframes-and-getting-the-differences – Sridhar Cr Mar 30 '20 at 16:58
  • Thank you all - the concatenation solution worked. – Coralie E Mar 30 '20 at 17:31
  • nice one, you can mark this post as a duplicate based on that answer which will help others who may find your post in future. – Umar.H Mar 30 '20 at 18:50

0 Answers0