0

I have a dataframe that looks like this :

   ID     time      city        transport
0  1      10:20:00  London      car
1  20     08:50:20  Berlin      air plane
2  44     21:10:00  Paris       train
3  32     10:24:00  Rome        car
4  56     08:53:10  Berlin      air plane
5  90     21:8:00   Paris       train
.
.
.
1009 446  10:21:24  London     car

I want to group these data so that same value in 'city' and 'transport' but with time difference of +3min or -3min should have the same 'ID'.

I already tried pd.Grouper() like this but didn't work:

df['time'] = pd.to_datetime(df['time'])
df['ID'] = df.groupby([pd.Grouper(key= 'time',freq ='3min'),'city','transport'])['ID'].transform('first')

The output is the first dataframe I had without any changes. One reason could be that by using .datetime the date will be added as well to "time" and because my data is very big the date will differ and groupby doesn't work. I couldn't figure it out how to add time intervall (+3min or -3min) while using groupby and without adding DATE to 'time' column.

What I'm expecting is this :

   ID     time      city        transport
0  1      10:20:00  London      car
1  20     08:50:20  Berlin      air plane
2  44     21:10:00  Paris       train
3  32     10:24:00  Rome        car
4  20     08:53:10  Berlin      air plane
5  44     21:8:00   Paris       train
.
.
.
1009 1  10:21:24  London     car

it has been a while that I'm struggling with this question and I really appreciate any help. Thanks in advance

Mr. T
  • 11,960
  • 10
  • 32
  • 54

2 Answers2

0

Exploring pd.Grouper()

  1. found it useful to insert start time so that it's more obvious how buckets are being generated
  2. you requirement +/- 3mins, most closely is a 6min bucket. Mostly matches your requirement but +/- 3 mins of what?
  3. have done something that just shows what has been grouped and shows time bucket

setup

df = pd.read_csv(io.StringIO("""   ID     time      city        transport
0  1      10:20:00  London      car
1  20     08:50:20  Berlin      air plane
2  44     21:10:00  Paris       train
3  32     10:24:00  Rome        car
4  56     08:53:10  Berlin      air plane
5  90     21:08:00   Paris       train
6  33  05:08:22  Paris  train"""), sep="\s\s+", engine="python")

# force in origin so grouper generates bucket every Xmins from midnight with no seconds...
df = pd.concat([pd.DataFrame({"time":[pd.Timedelta(0)],"dummy":[True]}), df]).assign(dummy=lambda dfa: dfa.dummy.fillna(False))
df = df.assign(td=pd.to_timedelta(df.time))

analysis

### DEBUGGER ### - see whats being grouped...
df.groupby([pd.Grouper(key="td", freq="6min"), "city","transport"]).agg(lambda x: list(x) if len(x)>0 else np.nan).dropna()
  • see that two time buckets will group >1 ID
time dummy ID
(Timedelta('0 days 05:06:00'), 'Paris', 'train') ['05:08:22'] [False] [33.0]
(Timedelta('0 days 08:48:00'), 'Berlin', 'air plane') ['08:50:20', '08:53:10'] [False, False] [20.0, 56.0]
(Timedelta('0 days 10:18:00'), 'London', 'car') ['10:20:00'] [False] [1.0]
(Timedelta('0 days 10:24:00'), 'Rome', 'car') ['10:24:00'] [False] [32.0]
(Timedelta('0 days 21:06:00'), 'Paris', 'train') ['21:10:00', '21:08:00'] [False, False] [44.0, 90.0]

solution

# finally +/- double the window.  NB this is not +/- but rows that group the same
(df.assign(ID=lambda dfa: dfa
           .groupby([pd.Grouper(key= 'td',freq ='6min'),'city','transport'])['ID']
           .transform('first'))
 # cleanup... NB needs changing if dummy row is not inserted
 .query("not dummy")
 .drop(columns=["td","dummy"])
 .assign(ID=lambda dfa: dfa.ID.astype(int))
)
time ID city transport
10:20:00 1 London car
08:50:20 20 Berlin air plane
21:10:00 44 Paris train
10:24:00 32 Rome car
08:53:10 20 Berlin air plane
21:08:00 44 Paris train
05:08:22 33 Paris train
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • thanks for your help, but it doesn't work. first error is`pected hh:mm:ss format`. But I fixed with adding `+':00'` at the end of `to_timedelta` . The problem is that I have the same dataframe without any changes at the end. –  Feb 06 '21 at 21:57
  • In your solution, at the end 'ID' 44 and 90 should be the same as well –  Feb 06 '21 at 22:01
  • ok, I worked around `21:8:00` by manually modifying it to `21:08:00` Do you have a sample data set that is not working for you? data in question worked with that one mod – Rob Raymond Feb 06 '21 at 22:02
  • I'll take a further look tmr - something odd is going on... increasing frequency means more get mapped but in unexpected ways – Rob Raymond Feb 06 '21 at 22:18
  • That will be very nice of you man. Since my data is very big (something around 13M rows) and some other issues I can't post it here that's why I'm using small datas like in this question to find out if the code works. –  Feb 06 '21 at 23:15
  • I've updated - getting more insight was all about getting transparency through analysis step – Rob Raymond Feb 07 '21 at 18:21
  • what I mean by +/- 3min is that rows which only have difference of 3min should have the same value in ID. For example if a row has the value **10:30**, another one has **10:33** and the last one has **10:27**, they all should have the same ID. –  Feb 08 '21 at 08:50
  • understood, whats' the algorithm to center on 10:30? why not choose 10:27 or 10:33 to check? your definition reduces to being recursive – Rob Raymond Feb 08 '21 at 09:24
0
def convert(seconds): 
    seconds = seconds % (24 * 3600) 
    hour = seconds // 3600
    seconds %= 3600
    minutes = seconds // 60
    seconds %= 60
    return hour,minutes,seconds

def get_sec(h,m,s):
     """Get Seconds from time."""
    if h==np.empty:
        h=0
    if m==np.empty:
        m=0
    if s==np.empty:
        s=0
    return int(h) * 3600 + int(m) * 60 + int(s)    

 df['time']=df['time'].apply(lambda x:       datetime.strptime(x,'%H:%M:%S') if isinstance(x,str) else x )

 df=df.sort_values(by=["time"])
 print(df)

 prev_hour=np.empty
 prev_minute=np.empty
 prev_second=np.empty
 for key,item in df.iterrows():
    curr_hour=item.time.hour
    curr_minute=item.time.minute
    curr_second=item.time.second
    curr_id=item.id
    curr_seconds=get_sec(curr_hour, curr_minute ,curr_second)
    prev_seconds=get_sec(prev_hour, prev_minute,prev_second)
    diff_seconds=curr_seconds-prev_seconds
hour,minute,second=convert(diff_seconds)
    if (hour==0) & (minute <=3):
        df.loc[key,'id']=prev_id
    prev_hour=item.time.hour
    prev_minute=item.time.minute
    prev_second=item.time.second
    prev_id=item.id

print(df)


output:
   id                time    city  transport
1  20 1900-01-01 08:50:20  Berlin  air plane
4  20 1900-01-01 08:53:10  Berlin  air plane
0   1 1900-01-01 10:20:00  London        car
3  32 1900-01-01 10:24:00    Rome        car
5  90 1900-01-01 21:08:00   Paris      train
2  90 1900-01-01 21:10:00   Paris      train
Golden Lion
  • 3,840
  • 2
  • 26
  • 35
  • see (https://stackoverflow.com/questions/6402812/how-to-convert-an-hmmss-time-string-to-seconds-in-python/6402934) – Golden Lion Feb 07 '21 at 14:06