This question is based on answers of question here: Join based on a concatenated 'origin-destination' variable and consecutive individual sites
I have a for loop that needs to iterate over +100 thousand rows, and performs too slow. The main task is to iterate over each row of pairs
and find the Vonat_ID variable in STATIONS
dataframe, then with the match
function find the stations in STATIONS
that go from HONNAN variable to HOVA variable from the iterating row in pairs
.
This is the pairs dataframe. (167 thousand rows)
head(pairs)
# A tibble: 6 x 4
# Groups: Vonat_ID [2]
Vonat_ID HONNAN HOVA OD_pár
<chr> <chr> <chr> <chr>
1 20210216-40517 Hegyeshalom Hegyeshalom oh. Hegyeshalom -- Hegyeshalom oh.
2 20210216-40517 Biharkeresztes oh. Püspökladány Biharkeresztes oh. -- Püspökladány
3 20210216-40517 Püspökladány Hegyeshalom oh. Püspökladány -- Hegyeshalom oh.
4 20210223-40517 Püspökladány Ferencváros Püspökladány -- Ferencváros
5 20210223-40517 Biharkeresztes oh. Püspökladány Biharkeresztes oh. -- Püspökladány
6 20210223-40517 Püspökladány Hegyeshalom oh. Püspökladány -- Hegyeshalom oh.
This is the STATIONS dataframe. (2.7 millions rows)
head(STATIONS)
# A tibble: 6 x 2
# Groups: Vonat_ID [1]
Vonat_ID Állomás
<chr> <chr>
1 20210216-40517 Biharkeresztes oh.
2 20210216-40517 Biharkeresztes
3 20210216-40517 Mezőpeterd
4 20210216-40517 Berettyóújfalu
5 20210216-40517 Sáp
6 20210216-40517 Báránd
...
20210216-40517 Öttevény
55 20210216-40517 Lébény-Mosonszentmiklós
56 20210216-40517 Kimle-Károlyháza
57 20210216-40517 Mosonmagyaróvár
58 20210216-40517 Hegyeshalom
59 20210216-40517 Hegyeshalom oh.
60 20210223-40517 Biharkeresztes oh.
61 20210223-40517 Biharkeresztes
62 20210223-40517 Mezőpeterd
63 20210223-40517 Berettyóújfalu
64 20210223-40517 Sáp
65 20210223-40517 Báránd
...
88 20210223-40517 Öttevény
89 20210223-40517 Lébény-Mosonszentmiklós
90 20210223-40517 Kimle-Károlyháza
91 20210223-40517 Mosonmagyaróvár
92 20210223-40517 Hegyeshalom
93 20210223-40517 Hegyeshalom oh.
...
I created vectors of each variable from pairs
to speed up the for loop. pairs_ID
, pairs_HN
, pairs_HV
and pairs_OD
, respectively.
The desired output looks like this: (stat is all the "Állomás" from STATIONS
between the "OD_pár" starting and endpoint. (HONNAN means FROM, HOVA means TO))
OD stat
1 Hegyeshalom -- Hegyeshalom oh. Hegyeshalom #"OD_pár" from the first row of `pairs`
2 Hegyeshalom -- Hegyeshalom oh. Hegyeshalom oh.
3 Biharkeresztes oh. -- Püspökladány Biharkeresztes oh.#"OD_pár" from the second row of `pairs`
4 Biharkeresztes oh. -- Püspökladány Biharkeresztes
5 Biharkeresztes oh. -- Püspökladány Mezőpeterd
6 Biharkeresztes oh. -- Püspökladány Berettyóújfalu
The for loop I am using is the following. I need a huge dataframe at the end with all the matching ODs and their station names, hence the list appending at the end of the loop. Also, I added an error catch function in case any appear.
bridge_total <- list()
for (x in 1:nrow(pairs)) {
tryCatch({
OD <- pairs_OD[x]
stat <- STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]][match(pairs_HN[x], STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]]) :
match(pairs_HV[x], STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]])]
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
df <- data.frame("OD"=OD, "stat"=stat)
bridge_total[[x]] <- df
print(x)
}
bridge_total <- do.call(rbind, bridge_total)
Is there a way I could speed up the loop even better? right now as of test runs, to iterate over +100k rows, it requires 2-3 hours minimum.