I am currently trying to cross the data of two dataframes (df and marche_type_jointure) about bus journeys grouped by certain columns (in english: stop, direction, type of day, season, year) as the code here shows:
df_grouped = df.groupby(['arret', 'sens', 'type_jour','saison', 'annee'])
marche_type_jointure_grouped = marche_type_jointure.groupby(['arret', 'sens', 'type_jour','saison', 'annee'])
Which gives me the following dataframes with the following relevant columns (see screenshots df_variables_1, df_variables_2, marche_type_variables) and you can use the following dataframes as examples (only the most important variables as you can see):
df_jointure = pd.DataFrame([
['Aller','VIG01A' , time(7,21,28),'vac_s' ,'hiver' ,1978 , 'NaN' ],
['Aller','VIG01A' , time(18,7,35),'vac_s' ,'hiver' ,1978 , 'NaN' ],
['Aller','LYC30R' , time(7,21,54),'vac_s' ,'hiver' ,1978 , 'NaN' ],
['Aller','LYC30R' , time(14,43,42),'vac_s' ,'hiver' ,1978 , 'NaN' ] ], columns=['sens', 'arret', 'heure_arrivee_reelle', 'type_jour', 'saison', 'annee', 'temps_trajet_mt'])
marche_type_jointure = df = pd.DataFrame([
['VIG01A', time(6,0,0), time(6,29,0), 'hiver', 'Aller', 4, 'vac_s', 1978],
['VIG01A', time(7,0,0), time(7,29,0), 'hiver', 'Aller', 4, 'vac_s', 1978],
['VIG01A', time(7,0,0), time(7,59,0), 'hiver', 'Aller', 4, 'vac_s', 1978],
['VIG01A', time(18,30,0), time(18,59,0), 'hiver', 'Aller', 4, 'vac_s', 1978],
['LYC30R', time(6,0,0), time(6,29,0), 'hiver', 'Aller', 1, 'semaine', 1978],
['LYC30R', time(7,45,0), time(7,59,0), 'hiver', 'Aller', 1, 'semaine', 1978],
['LYC30R', time(9,30,0), time(15,59,00), 'hiver', 'Aller', 1, 'semaine', 1978]],
columns=['arret', 'heure_debut_periode', 'heure_fin_periode', 'saison', 'sens', 'temps_trajet_sur_periode', 'type_jour', 'annee])
Here are also the most important variables:
in df : 'sens'
(column 6),
'arret'
(stop, column 9),
'heure_arrivee_reelle'
(actual time of arrival (of the bus), column 13),
'type_jour'
(type of day, with 'vac_s' which means holidays, 'samedi' which is saturday, 'dimanche' which is sunday, and eventually 'semaine' which means week, column 25),
'saison'
(which is season: 'hiver' for winter or 'ete' for summer, column 26),
'annee'
(which is year, either 2018 or 1978 that refers winter 2017-2018, i.e. from January 2017 to May 2017 then September 2017 to May 2018, column 29),
'temps_trajet_mt' which is empty for the moment but the values will come from 'marche_type_jointure_grouped'(.temps_trajet_sur_periode).
For marche_type_jointure_grouped: 'arret'
(stop), 'heure_debut_periode'
and 'heure_fin_periode'
which are the beginning and ending of a certain period of time in which there is a certain 'temps_trajet_sur_periode'
which is the time necessary to go from the previous stop to the one we're at (in the bus line). 'type_jour'
, 'saison'
and 'sens'
are the same as in df_grouped
.
What I'm actually looking to do is, for each 'key' of the grouped dataframes(stop, direction, type of day, season and year), compare the value 'heure_arrivee_reelle'
('actual_time_of_arrival' in english) in df_grouped
with the different values of 'heure_fin_periode'
('end_time_period' in english) and/or 'heure_debut_periode'
('starting_time_period' following this condition :
'if heure_arrivee_reelle in [heure_debut_periode, heure_fin_periode]' (as an interval, but there might be an easier condition giving the same result) then the value of temps_trajet_mt in df_grouped becomes the corresponding value of temps_trajet_sur_periode in marche_type_jointure_grouped
.
Obviously I'm looking to do it in an efficient way. Here's what I tried to do so that maybe you understand a little bit more what I'm trying to do:
for key,group in df_grouped:
df_mt = marche_type_jointure_grouped.get_group(key) #we take the corresponding group from the other dataframe
index_har = list(group.heure_arrivee_reelle.index.values) #list of index of actual arrival times from df_grouped referring to key
index_hdp = list(df_mt.heure_debut_periode.index.values) #list of index of beginning time of period from marche_type_jointure_grouped refering to key
for i in index_hdp:
for j in index_har:
if df_mt.heure_fin_periode[i] >= group.heure_arrivee_reelle[j]:
group.temps_trajet_mt[j] = df_mt.temps_trajet_sur_periode[i]
index_har.remove(j) #so that I do not have to compare it again
else:
pass
df_grouped.size().unstack() #so that I can see the result
Thank you in advance for your help, it really means a lot I've been on this for more than a week now ! And please tell me if what I'm trying to do isn't clear (it's more or less a conditional joint, if you will...)