[Python][Pandas] Efficiently crossing and comparing the values in columns of two groupby

Question

I am currently trying to cross the data of two dataframes (df and marche_type_jointure) about bus journeys grouped by certain columns (in english: stop, direction, type of day, season, year) as the code here shows:

df_grouped = df.groupby(['arret', 'sens', 'type_jour','saison', 'annee'])
marche_type_jointure_grouped = marche_type_jointure.groupby(['arret', 'sens', 'type_jour','saison', 'annee'])

Which gives me the following dataframes with the following relevant columns (see screenshots df_variables_1, df_variables_2, marche_type_variables) and you can use the following dataframes as examples (only the most important variables as you can see):

df_jointure = pd.DataFrame([
['Aller','VIG01A' , time(7,21,28),'vac_s' ,'hiver' ,1978 , 'NaN' ], 
['Aller','VIG01A' , time(18,7,35),'vac_s' ,'hiver' ,1978 , 'NaN' ], 
['Aller','LYC30R' , time(7,21,54),'vac_s' ,'hiver' ,1978 , 'NaN' ], 
['Aller','LYC30R' , time(14,43,42),'vac_s' ,'hiver' ,1978 , 'NaN' ] ], columns=['sens', 'arret', 'heure_arrivee_reelle', 'type_jour', 'saison', 'annee', 'temps_trajet_mt'])


marche_type_jointure = df = pd.DataFrame([
['VIG01A', time(6,0,0), time(6,29,0), 'hiver', 'Aller', 4, 'vac_s', 1978],
['VIG01A', time(7,0,0), time(7,29,0), 'hiver', 'Aller', 4, 'vac_s', 1978], 
['VIG01A', time(7,0,0), time(7,59,0), 'hiver', 'Aller', 4, 'vac_s', 1978], 
['VIG01A', time(18,30,0), time(18,59,0), 'hiver', 'Aller', 4, 'vac_s', 1978], 
['LYC30R', time(6,0,0), time(6,29,0), 'hiver', 'Aller', 1, 'semaine', 1978], 
['LYC30R', time(7,45,0), time(7,59,0), 'hiver', 'Aller', 1, 'semaine', 1978],
['LYC30R', time(9,30,0), time(15,59,00), 'hiver', 'Aller', 1, 'semaine', 1978]], 
columns=['arret', 'heure_debut_periode', 'heure_fin_periode', 'saison', 'sens', 'temps_trajet_sur_periode', 'type_jour', 'annee])

Here are also the most important variables: in df : 'sens' (column 6), 'arret' (stop, column 9), 'heure_arrivee_reelle' (actual time of arrival (of the bus), column 13), 'type_jour' (type of day, with 'vac_s' which means holidays, 'samedi' which is saturday, 'dimanche' which is sunday, and eventually 'semaine' which means week, column 25), 'saison' (which is season: 'hiver' for winter or 'ete' for summer, column 26), 'annee' (which is year, either 2018 or 1978 that refers winter 2017-2018, i.e. from January 2017 to May 2017 then September 2017 to May 2018, column 29), 'temps_trajet_mt' which is empty for the moment but the values will come from 'marche_type_jointure_grouped'(.temps_trajet_sur_periode).

For marche_type_jointure_grouped: 'arret' (stop), 'heure_debut_periode' and 'heure_fin_periode' which are the beginning and ending of a certain period of time in which there is a certain 'temps_trajet_sur_periode' which is the time necessary to go from the previous stop to the one we're at (in the bus line). 'type_jour', 'saison' and 'sens' are the same as in df_grouped.

What I'm actually looking to do is, for each 'key' of the grouped dataframes(stop, direction, type of day, season and year), compare the value 'heure_arrivee_reelle' ('actual_time_of_arrival' in english) in df_grouped with the different values of 'heure_fin_periode' ('end_time_period' in english) and/or 'heure_debut_periode'('starting_time_period' following this condition : 'if heure_arrivee_reelle in [heure_debut_periode, heure_fin_periode]' (as an interval, but there might be an easier condition giving the same result) then the value of temps_trajet_mt in df_grouped becomes the corresponding value of temps_trajet_sur_periode in marche_type_jointure_grouped. Obviously I'm looking to do it in an efficient way. Here's what I tried to do so that maybe you understand a little bit more what I'm trying to do:

for key,group in df_grouped:
df_mt = marche_type_jointure_grouped.get_group(key) #we take the corresponding group from the other dataframe
index_har = list(group.heure_arrivee_reelle.index.values) #list of index of actual arrival times from df_grouped referring to key
index_hdp = list(df_mt.heure_debut_periode.index.values) #list of index of beginning time of period from marche_type_jointure_grouped refering to key
for i in index_hdp:
    for j in index_har:
        if df_mt.heure_fin_periode[i] >= group.heure_arrivee_reelle[j]:
            group.temps_trajet_mt[j] = df_mt.temps_trajet_sur_periode[i]
            index_har.remove(j) #so that I do not have to compare it again
        else:
            pass
df_grouped.size().unstack() #so that I can see the result

Thank you in advance for your help, it really means a lot I've been on this for more than a week now ! And please tell me if what I'm trying to do isn't clear (it's more or less a conditional joint, if you will...)

Welcome to SO. Please provide a **[mcve]**. For Pandas-specific advice, see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — jpp, Aug 22 '18 at 08:17
I actually solved the problem: I used a merge and then two filters on my data after computing the time periods (instead of doing the comparison, I think it is more efficient). Thanks anyway :-) — tazih3, Aug 24 '18 at 08:13
Sure, in that case you can post your own answer. This can help the wider community. — jpp, Aug 24 '18 at 08:26

[Python][Pandas] Efficiently crossing and comparing the values in columns of two groupby

0 Answers0