0

This question is based on answers of question here: Join based on a concatenated 'origin-destination' variable and consecutive individual sites

I have a for loop that needs to iterate over +100 thousand rows, and performs too slow. The main task is to iterate over each row of pairs and find the Vonat_ID variable in STATIONS dataframe, then with the match function find the stations in STATIONS that go from HONNAN variable to HOVA variable from the iterating row in pairs.

This is the pairs dataframe. (167 thousand rows)

 head(pairs)
# A tibble: 6 x 4
# Groups:   Vonat_ID [2]
  Vonat_ID       HONNAN             HOVA            OD_pár                            
  <chr>          <chr>              <chr>           <chr>                             
1 20210216-40517 Hegyeshalom        Hegyeshalom oh. Hegyeshalom -- Hegyeshalom oh.    
2 20210216-40517 Biharkeresztes oh. Püspökladány    Biharkeresztes oh. -- Püspökladány
3 20210216-40517 Püspökladány       Hegyeshalom oh. Püspökladány -- Hegyeshalom oh.   
4 20210223-40517 Püspökladány       Ferencváros     Püspökladány -- Ferencváros       
5 20210223-40517 Biharkeresztes oh. Püspökladány    Biharkeresztes oh. -- Püspökladány
6 20210223-40517 Püspökladány       Hegyeshalom oh. Püspökladány -- Hegyeshalom oh.   

This is the STATIONS dataframe. (2.7 millions rows)

head(STATIONS)
# A tibble: 6 x 2
# Groups:   Vonat_ID [1]
  Vonat_ID       Állomás           
  <chr>          <chr>             
1 20210216-40517 Biharkeresztes oh.
2 20210216-40517 Biharkeresztes    
3 20210216-40517 Mezőpeterd        
4 20210216-40517 Berettyóújfalu    
5 20210216-40517 Sáp               
6 20210216-40517 Báránd            
...
   20210216-40517 Öttevény                
55 20210216-40517 Lébény-Mosonszentmiklós 
56 20210216-40517 Kimle-Károlyháza        
57 20210216-40517 Mosonmagyaróvár         
58 20210216-40517 Hegyeshalom             
59 20210216-40517 Hegyeshalom oh.
60 20210223-40517 Biharkeresztes oh. 
61 20210223-40517 Biharkeresztes     
62 20210223-40517 Mezőpeterd         
63 20210223-40517 Berettyóújfalu     
64 20210223-40517 Sáp                
65 20210223-40517 Báránd             
...
88 20210223-40517 Öttevény               
89 20210223-40517 Lébény-Mosonszentmiklós
90 20210223-40517 Kimle-Károlyháza       
91 20210223-40517 Mosonmagyaróvár        
92 20210223-40517 Hegyeshalom            
93 20210223-40517 Hegyeshalom oh.
...


     

I created vectors of each variable from pairs to speed up the for loop. pairs_ID, pairs_HN, pairs_HV and pairs_OD, respectively.

The desired output looks like this: (stat is all the "Állomás" from STATIONS between the "OD_pár" starting and endpoint. (HONNAN means FROM, HOVA means TO))

                                  OD               stat
1     Hegyeshalom -- Hegyeshalom oh.       Hegyeshalom #"OD_pár" from the first row of `pairs`
2     Hegyeshalom -- Hegyeshalom oh.     Hegyeshalom oh.
3 Biharkeresztes oh. -- Püspökladány  Biharkeresztes oh.#"OD_pár" from the second row of `pairs`
4 Biharkeresztes oh. -- Püspökladány     Biharkeresztes
5 Biharkeresztes oh. -- Püspökladány         Mezőpeterd
6 Biharkeresztes oh. -- Püspökladány     Berettyóújfalu

The for loop I am using is the following. I need a huge dataframe at the end with all the matching ODs and their station names, hence the list appending at the end of the loop. Also, I added an error catch function in case any appear.

bridge_total <- list()
for (x in 1:nrow(pairs)) {
tryCatch({
    OD <- pairs_OD[x]
    stat <- STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]][match(pairs_HN[x], STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]]) :
                   match(pairs_HV[x], STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]])]
    }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
  df <- data.frame("OD"=OD, "stat"=stat)
  bridge_total[[x]] <- df
  print(x)
}
bridge_total <- do.call(rbind, bridge_total)

Is there a way I could speed up the loop even better? right now as of test runs, to iterate over +100k rows, it requires 2-3 hours minimum.

  • 4
    I don't fully understand the description but I'm pretty sure you should use data.table joins and not what you are using now. – Roland Aug 10 '22 at 12:59
  • 2
    drop the for-loop, start reading about joins https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right – Wimpel Aug 10 '22 at 13:09
  • @wimpel it is not a joining problem. Maybe the description is a bit unclear, but it's not simple joining what I need here – Rudolf Nyitray Aug 10 '22 at 13:25
  • 1
    then i do not understand your question right. Can you provide a *minimal* sample dataset, and the desired output for this data? – Wimpel Aug 10 '22 at 13:28
  • @wimpel, tried to add some more details. – Rudolf Nyitray Aug 10 '22 at 13:46

1 Answers1

1

If I understood correctly, this small dataset should reproduce the setup:

library(data.table)

journeys <- rbind(
  data.table(line = 1, origin = "A", destination = "C"),
  data.table(line = 2, origin = "B", destination = "E"),
  data.table(line = 2, origin = "D", destination = "E")
)

journeys
#>    line origin destination
#> 1:    1      A           C
#> 2:    2      B           E
#> 3:    2      D           E

stops <- rbind(
  data.table(line = 1, station = c("A", "B", "C")),
  data.table(line = 2, station = c("D", "B", "E"))
)

stops
#>    line station
#> 1:    1       A
#> 2:    1       B
#> 3:    1       C
#> 4:    2       D
#> 5:    2       B
#> 6:    2       E

And now the goal is to find all stops that constitute a journey.

Instead of looping, it may be faster to first use joins to find the origin and destination indices in the stops data frame:

setkey(journeys, line)
setkey(stops, line)

stops[, idx := .I]

journeys[stops, oidx := idx, on = .(line, origin = station)]
journeys[stops, didx := idx, on = .(line, destination = station)]
journeys[, len := 1L + didx - oidx]

journeys
#>    line origin destination oidx didx len
#> 1:    1      A           C    1    3   3
#> 2:    2      B           E    5    6   2
#> 3:    2      D           E    4    6   3

Then extract the corresponding stops:

journeys[, cbind(journey = rep(.I, len), stops[sequence(len, oidx)])]
#>    journey line station idx
#> 1:       1    1       A   1
#> 2:       1    1       B   2
#> 3:       1    1       C   3
#> 4:       2    2       B   5
#> 5:       2    2       E   6
#> 6:       3    2       D   4
#> 7:       3    2       B   5
#> 8:       3    2       E   6
Mikko Marttila
  • 10,972
  • 18
  • 31
  • perfect, many thanks! You just nailed it! Since there are some 'lines' that are not present in journeys, not all values could be extracted, but it is just a matter of data cleanness. Appreciate! from hours to 3 seconds – Rudolf Nyitray Aug 16 '22 at 08:26