3

I have dataset for one bus line every day with 32 buses and two route_direction(0,1), in the first direction there are 18 stations each one have a seq from 1 to 18 and the other direction has 15 station with seq(1-15) and recorded time when enter/exit each station. each record contains bus_id, route_direction, station_seq, in_time, out_time, station_id. enter image description here

route_id    route_direction bus_id  station_seq schdeule_date   in_time out_time

0   59  1   1349508393  2   2021-01-01  05:04:31    05:04:58

1   59  1   1349508393  2   2021-01-01  05:04:27    05:04:58

2   59  1   1349508393  2   2021-01-01  05:04:31    05:06:31

3   59  1   1349508393  2   2021-01-01  05:04:27    05:06:31

4   59  1   1349508393  1   2021-01-01  05:00:35    05:00:56

first I have tried to groupby some column to give index to each trip with this:

grouped = df.groupby(['bus_id', 'route_direction'])

I get something like in this imageenter image description here:

index   route_id    route_direction bus_id  station_seq schdeule_date   in_time out_time

654 59  0   1349508329  1   2021-01-01  NaN 06:34:10

663 59  0   1349508329  2   2021-01-01  06:33:34    06:34:04

664 59  0   1349508329  2   2021-01-01  06:33:33    06:34:04

677 59  0   1349508329  2   2021-01-01  06:33:34    06:35:34

678 59  0   1349508329  2   2021-01-01  06:33:33    06:35:34

... ... ... ... ... ... ... ...

12133   59  0   1349508329  12  2021-01-01  NaN NaN

As you can see there is also duplicates in the same station enter exit for the same bus_id in almost the same date and time: I have tried drop duplicates but no luck to work well:

df = df.drop_duplicates(subset=['bus_id', 'route_direction', 'station_seq', 'station_id', 'in_time'], keep='first').reset_index(drop=True)

also there is some NaN values in in_time or out_time so if I dropna then I will may miss records for one of the stations along the bus line.

Any help to group each bus records in one trip to give it id and how can I drop the duplicated records in this case(small different in entering time)? Any help will be appreciated.

khaled
  • 51
  • 3

1 Answers1

1
  1. sort_values with 'bus_id' and 'in_time'
  2. groupby 'bus_id', for every bus_id, calculate time-diff for every records with it's previous record
  3. if the time-diff is less than 60s, then tag with 0, else tag with 1, in order to set some groups to ignore the time-diff < 60s
  4. use cumsum on the tag, to create grouptag
  5. groupby grouptag, for every grouptag keep min(in_time) and max(out_time)
# convert the in_time to dateTime first, then sorted the values
df['in_time_t'] = pd.to_datetime(df['schdeule_date'] + ' ' + df['in_time'])
df.sort_values(['bus_id', 'in_time_t'], inplace=True)

# calculate the time difference for every bus_id
df['t_diff'] = df.groupby('bus_id')['in_time_t'].diff()

# set group_tag
cond = df['t_diff'].dt.seconds < 60
df['tag'] = np.where(cond, 0, 1).cumsum()

# for every grouptag keep min(in_time) and max(out_time)
df_result = df.groupby(['route_id', 'route_direction', 'bus_id', 'station_seq', 'schdeule_date',
       'tag']).agg({'in_time':'min', 'out_time':'max'}).reset_index()
df
        route_id    route_direction bus_id  station_seq schdeule_date   in_time out_time
    0   59  1   1349508393  2   2021-01-01  05:04:31    05:04:58
    1   59  1   1349508393  2   2021-01-01  05:04:27    05:04:58
    2   59  1   1349508393  2   2021-01-01  05:04:31    05:06:31
    3   59  1   1349508393  2   2021-01-01  05:04:27    05:06:31
    4   59  1   1349508393  1   2021-01-01  05:00:35    05:00:56
    654 59  0   1349508329  1   2021-01-01  NaN 06:34:10
    663 59  0   1349508329  2   2021-01-01  06:33:34    06:34:04
    664 59  0   1349508329  2   2021-01-01  06:33:33    06:34:04
    677 59  0   1349508329  2   2021-01-01  06:33:34    06:35:34
    678 59  0   1349508329  2   2021-01-01  06:33:33    06:35:34
    12133   59  0   1349508329  12  2021-01-01  NaN NaN

df_result
        route_id    route_direction bus_id  station_seq schdeule_date   tag in_time out_time
    0   59  0   1349508329  1   2021-01-01  2   NaN 06:34:10
    1   59  0   1349508329  2   2021-01-01  1   06:33:33    06:35:34
    2   59  0   1349508329  12  2021-01-01  3   NaN NaN
    3   59  1   1349508393  1   2021-01-01  4   05:00:35    05:00:56
    4   59  1   1349508393  2   2021-01-01  5   05:04:27    05:06:31

df with tag
|       |   route_id |   route_direction |     bus_id |   station_seq | schdeule_date   | in_time   | out_time   | in_time_t           | t_diff          |   tag |
|------:|-----------:|------------------:|-----------:|--------------:|:----------------|:----------|:-----------|:--------------------|:----------------|------:|
|   664 |         59 |                 0 | 1349508329 |             2 | 2021-01-01      | 06:33:33  | 06:34:04   | 2021-01-01 06:33:33 | NaT             |     1 |
|   678 |         59 |                 0 | 1349508329 |             2 | 2021-01-01      | 06:33:33  | 06:35:34   | 2021-01-01 06:33:33 | 0 days 00:00:00 |     1 |
|   663 |         59 |                 0 | 1349508329 |             2 | 2021-01-01      | 06:33:34  | 06:34:04   | 2021-01-01 06:33:34 | 0 days 00:00:01 |     1 |
|   677 |         59 |                 0 | 1349508329 |             2 | 2021-01-01      | 06:33:34  | 06:35:34   | 2021-01-01 06:33:34 | 0 days 00:00:00 |     1 |
|   654 |         59 |                 0 | 1349508329 |             1 | 2021-01-01      | nan       | 06:34:10   | NaT                 | NaT             |     2 |
| 12133 |         59 |                 0 | 1349508329 |            12 | 2021-01-01      | nan       | nan        | NaT                 | NaT             |     3 |
|     4 |         59 |                 1 | 1349508393 |             1 | 2021-01-01      | 05:00:35  | 05:00:56   | 2021-01-01 05:00:35 | NaT             |     4 |
|     1 |         59 |                 1 | 1349508393 |             2 | 2021-01-01      | 05:04:27  | 05:04:58   | 2021-01-01 05:04:27 | 0 days 00:03:52 |     5 |
|     3 |         59 |                 1 | 1349508393 |             2 | 2021-01-01      | 05:04:27  | 05:06:31   | 2021-01-01 05:04:27 | 0 days 00:00:00 |     5 |
|     0 |         59 |                 1 | 1349508393 |             2 | 2021-01-01      | 05:04:31  | 05:04:58   | 2021-01-01 05:04:31 | 0 days 00:00:04 |     5 |
|     2 |         59 |                 1 | 1349508393 |             2 | 2021-01-01      | 05:04:31  | 05:06:31   | 2021-01-01 05:04:31 | 0 days 00:00:00 |     5 |

Ferris
  • 5,325
  • 1
  • 14
  • 23
  • thank you for you try to help. I do apply your suggested solution but it's not sufficient because in some cases the time when bus enter the next station is smaller than the time for exit current station. BTW, I have also big problem with NaN values in-case I dropn those rows I may miss the records for many stations along the trip, any suggestion to deal with those NaN values. Really appreciate your help. – khaled Feb 02 '21 at 08:32
  • maybe you could provide more sample data to cover all the situations! – Ferris Feb 02 '21 at 08:36
  • How can I send you a sample of data? – khaled Feb 02 '21 at 08:38
  • you can update your question with more data, or share using google driver. – Ferris Feb 02 '21 at 08:47
  • Here is a sample of data https://drive.google.com/file/d/1nxTL6lmXwe5cVVCCgWZ-89jJAEuAmol-/view?usp=sharing – khaled Feb 02 '21 at 09:01
  • In you sample file, you have 409 row whose `cond = df['in_time'] > df['out_time']`, that depends on how you want to handle it. – Ferris Feb 02 '21 at 09:03
  • I just check it, they are 64 rows where df['in_time'] > df['out_time'] – khaled Feb 02 '21 at 09:08
  • So in your opinion what is the best way to deal with such case. Is it reasonable to fill NaN value withe df['out_time '] - df['dwell'].mean() .... where dwell is the difference between exit and enter of the station. – khaled Feb 02 '21 at 09:13
  • 1. for na values, fill with "0:00"; 2. for 'out_time' < 'in_time', switch the time – Ferris Feb 02 '21 at 09:18
  • for only 1 `in_time` or `out_time` is null, fill with the not null in_time or out_time, for both null, fill with `0:00` – Ferris Feb 02 '21 at 09:20