R: fast/parallel multicolumn multirow lookup

Question

I have many large (300k - 1M row) dataframes that I'm trying to append values to by looping through the dataframes (df_i) and for each one, looping through the rows and asking what the value in the second dataframe is (do2) at matching latitude, longitude, month, and depth. Lat/lon/month will match exactly, the depth is trickier since do2 has 57 columns for values in increasing depth bins:

The core of my by-row loop code is a 3-ands row subset & a column subset:

for (j in 1:nrow(df_i)) {
  df_i[j,"DO2"] <- do2[do2$Latitude == df_i[j,"latbin"] &
                       do2$Longitude == df_i[j,"lonbin"] &
                       do2$Month == month(df_i[j,"DateTimeUTCmin5"]),
                       which.min(abs(depthbins - df_i[j, "Depth.m."])) + 3]
}

This works, but it's slow. I know it could be sped up but my parallelisation efforts keep hitting walls, and debug/traceback is much harder in parallel. I tried FBM after reading this but get

value must be unique or of the dimension of x[i, j]

some 200k rows in. I understand data.table indexes are fast, so possibly something like Frank's comment here might work, maybe a multi-row subset in data.table? But presumably that would be the same approach as my existing solution (since I also need to subset/lookup columns), just maybe a bit quicker?

Does anyone know of a smarter approach? I've previously been confounded by apply functions but wouldn't be surprised if there was something useful there?

Thanks in advance.

Reproducible (simplified month, depthbins added as was omitted before):

depthbins <- c(0,5,10,15,20,25,50,75,100,125,150,200,250,300,350,400)
df_i <- data.frame(latbin = c(-77.5, -78, -78.5),
                   lonbin = c(-178.5, -177.5, -176.5),
                   month = c(1,2,3),
                   Depth.m. = c(130,120,110))
do2 <- tibble(Month = c(1,1,1),
              Latitude = c(-78,-78,-79),
              Longitude = c(-178.5, -177.5, -177.5),
              "0" = c(214, 223, 345),
              "5" = c(123,234,345),
              "10" = c(345,456,567))

Final edit: some tweaks to Marius' code:

do2 %<>% gather(.vars = colnames(do2)[4:length(colnames(do2))],
                key = "depbin", value = "DO2")
do2$depbin <- as.numeric(do2$depbin)
depthbins <- sort(unique(do2$depbin))
df_i$depbin = sapply(df_i$Depth.m., function(d) depthbins[which.min(abs(depthbins - d))])

df_i %<>% left_join(do2, by = c("Month" = "Month",
                                "latbin" = "Latitude",
                                "lonbin" = "Longitude",
                                "depbin" = "depbin")) %>%
          select(-Month, -latbin, -lonbin, -depbin)

Can you provide reproducible examples of `df_i`, `do2` and `depthbins`? They would only have to be a few rows each, with a few depth bins. I think this is definitely possible to speed up, using something like a merge (possibly a fuzzy join due to the matching of `Depth.m`), and without necessarily using any parallel processing. — Marius, May 17 '19 at 01:54
Also, with your matching of `Depth.m`, are you just looking for the bin that the `Depth.m` value sits in, by comparing against the midpoint of the bin? So any value `5 <= Depth.m < 10` gets matched with the `5` bin? — Marius, May 17 '19 at 01:56
Thanks Marius. Repro added. Depth.m matching: I've used the min distance from df_i depth to depthbin values to generate an index but thinking about that line now, it relates to the bin value which is the start of the bin not the midpoint, so probably I need to improve that! — dez93_2000, May 17 '19 at 02:14
Edit: depthbins code does work, presuming values were taken from those depthbins depth points, the code aligns the df_i depth to the nearest depthbins depth which is correct. Should have known it was right, since I didn't write it myself!! — dez93_2000, May 17 '19 at 02:25
More thoughts: potentially I could gather the multiple depth columns into 1 (https://r4ds.had.co.nz/tidy-data.html#gathering), changing do2 from 49128*60 to 2800296*4. This would make it 3.8X bigger but potentially facilitate standard column-only lookup/match/index/join operations. Additionally I could replace the depths with depthbin lookups beforehand to remove a calculation from the loop... — dez93_2000, May 17 '19 at 02:40

score 1 · Accepted Answer · answered May 17 '19 at 03:22

1

I think with a little reorganisation you can do this as a merge. The merge part should be much, much faster than your for loop approach, which will be offset slightly by the increased size of do2 and the prep time. Note I've had to modify your example data a bit so that each row actually has something to match against:

depthbins <- c(0,5,10,15,20,25,50,75,100,125,150,200,250,300,350,400)
df_i <- data.frame(latbin = c(-77.5, -78, -78.5),
                   lonbin = c(-178.5, -177.5, -176.5),
                   month = c(1,2,3),
                   Depth.m. = c(130,120,110))
do2 <- tibble(Month = c(1,2,3),
              Latitude = c(-77.5,-78,-78.5),
              Longitude = c(-178.5, -177.5, -176.5),
              "100" = c(214, 223, 345),
              "125" = c(123,234,345),
              "150" = c(345,456,567))


library(tidyverse)
# Precalculate closest bin for each row
df_i$bin = sapply(df_i$Depth.m., function(d) depthbins[which.min(abs(depthbins - d))])

# Convert do2 to long
do2_long = do2 %>%
    gather(bin, DO2, -Month, -Latitude, -Longitude) %>%
    mutate(bin = as.numeric(bin))

# Now everything can just be done as a merge
# The merge syntax would be a bit cleaner if you give the two df's
#   matching column names to start with
df_i %>%
    left_join(do2_long, by = c("month" = "Month", "latbin" = "Latitude", 
                               "lonbin" = "Longitude", "bin" = "bin"))

answered May 17 '19 at 03:22

Marius

58,213
16
107
105

Thanks so much! The sapply is such an elegant solution, this is why I need to get my head around apply! Interested in your gather approach - I found this resulted in NAs for all do2_long$bin values. Also there seem to be twice as many rows as I'd expect. I'll paste my approach to this in the question. CHEERS!! – dez93_2000 May 17 '19 at 04:09
1

I was trying to `gather` everything except the `Month`, `Latitude` and `Longitude` columns, since that would mean I was gathering all the depth bin columns. You may have to tweak the `gather` command if you have other columns in `do2`. `as.numeric` may be making all the `bin` values `NA` if they aren't coming out as simple strings like `"5"`, `"10"`, etc. – Marius May 17 '19 at 04:12
Cheers Marius - sorted now with gather code I was working on at the same time, & question updated with tweaked code. You've saved an otherwise unproductive day - thanks again. – dez93_2000 May 17 '19 at 04:18
1

No worries. I'd be interested to know roughly how much this speeds things up, since examples like this are always good when you're trying to explain the disadvantages of for loops in R. – Marius May 17 '19 at 04:21
Qualitative response: MASSIVELY. And, resulted in a HUGE reduction in code lines for a much cleaner script. And, solving with *apply should allow further speedups using (e.g.) parSapply which is a (near) drop-in replacement, compared to (e.g.) foreach which requires various structural rearrangements (and which I can't figure out how to make work!). Very peculiarly, this task has caused the largest file to SHRINK from 410 to 380mb, having had 1 column of 1.6M rows added to it. Total size of all files is the same so maybe this is just compression optimisation. Still seems super weird! – dez93_2000 May 17 '19 at 06:12

R: fast/parallel multicolumn multirow lookup

1 Answers1