0

I am trying to parallelize a nested loop in which I substitute, for the common variables (changevars) between two datasets, within every country (v5) in it, every observation using its id (v3). I have to use the country+id since the id's are duplicated between countries.

My loop code is:

for (var in changevars) {

print(var)

for (i in unique(int2006$v5)) {

print(i)

for (id in unique(int2006$v3)) {

x2006r[x2006r$v5 == i & x2006r$v3 == id, var] <- int2006[int2006$v5 == i & int2006$v3 == id, var]    

}

}

}

I want to parallelize it.

Although it works, it is really slow. And I do not get the logic behind the changing from a for to a foreach loop with dopar. I've tried to understand the other answers, but my attempts have been all failures.

Reproducible example of datasets:

  1. Source Dataset
> dput(int2006)
structure(list(v3 = c(10001, 10002, 10003, 10004, 10005, 10006, 
10007, 10008, 10009, 10010, 10011, 10012, 10013, 10014, 10015, 
10016, 10017, 10018, 10019, 10020), v5 = c(36, 36, 36, 36, 36, 
36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36), 
    v7 = c(3606, 3606, 3606, 3606, 3606, 3606, 3606, 3606, 3606, 
    3606, 3606, 3606, 3606, 3606, 3606, 3606, 3606, 3606, 3606, 
    3606), v8 = c(1, 1, 2, 1, NA, NA, 1, 2, 2, 2, NA, 2, 2, 1, 
    1, 1, 2, 2, 1, 2), v9 = c(NA, 2, 1, 2, 1, 1, 1, 2, 4, 1, 
    NA, 1, NA, 1, 1, 1, 1, 1, 1, 2)), row.names = c(NA, 20L), class = "data.frame")
  1. Target Dataset (the one to which the cells of 1 should be copied):
    > dput(x2006r)
structure(list(v3 = c(10001, 10002, 10003, 10004, 10005, 10006, 
10007, 10008, 10009, 10010, 10011, 10012, 10013, 10014, 10015, 
10016, 10017, 10018, 10019, 10020), v5 = c(36, 36, 36, 36, 36, 
36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36), 
    v7 = c("3606", "3606", "3606", "3606", "3606", "3606", "3606", 
    "3606", "3606", "3606", "3606", "3606", "3606", "3606", "3606", 
    "3606", "3606", "3606", "3606", "3606"), v8 = c(1, 1, 2, 
    1, NA, NA, 1, 2, 2, 2, NA, 2, 2, 1, 1, 1, 2, 2, 1, 2), v9 = c(NA, 
    2, 1, 2, 1, 1, 1, 2, 4, 1, NA, 1, NA, 1, 1, 1, 1, 1, 1, 2
    )), row.names = c(NA, 20L), class = "data.frame")
  1. Variables to iterate
changevars <- c("v7","v8","v9")

Can someone help me? I'm really stuck. Also, I am not sure if parallelizing this loop will help me in terms of speed.

Thank you very much!

  • Hi, could you provide a minimal reproductible example for your dataset format ? See here https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 – Gowachin Apr 06 '21 at 14:46
  • Sure. Added. Thank you! – timeseriescry Apr 06 '21 at 15:14
  • Thanks but what is changevars ?? Also please for future time include smaller and comprehensible examples. Too big dataset can be harder to look at... – Gowachin Apr 06 '21 at 15:34
  • 1
    What you added isn't a minimal example. It'll be much easier for us to help you debug if we can figure out what you're trying to accomplish more easily. Try to keep the data in the `reprex` as small as possible (like, only as many rows and columns as you absolutely need) – Matt Kaye Apr 06 '21 at 15:40
  • Also we can't reproduce what you want since `"cumulation"` is not in one dataset (`int2006a`). – Gowachin Apr 06 '21 at 15:44
  • With relatively large input, meaningless variable and column names, and no desired output, I'm having trouble understanding exactly what your goal is, but seems like it might be an update join? Or maybe just a regular join? Maybe `merge(x2006ra, int2006a[c("v3", "v5", changevars)], all.x = TRUE)` ? If the `changevars` are already present in the target, it would be useful to know if you expect to replace all values of them, or if there are some values without matches that need to be retained. – Gregor Thomas Apr 06 '21 at 16:03
  • Ok. I Fixed the reproducible example with only the variables needed for the loop and the three variables that are supposed to be replaced cell by cell from one dataset to the other. Thank you for your time and sorry for the bothering. The loop works, what I want is to parallelize it, but I do not understand the logic behind it. – timeseriescry Apr 06 '21 at 16:08
  • Gregor: I want to replace the cells from one dataset to the other, and the loop should only replace those cells with matches. With this for loop form it works as I want, but it is extremely slow, that is why I am trying to parallelize it, but I am unable for the moment. – timeseriescry Apr 06 '21 at 16:13

1 Answers1

0

This is a common operation called an "update join". A new dplyr utility function makes it very easy:

library(dplyr)
join_vars <- c("v3", "v5")
changevars <- c("v7","v8","v9")
result <- rows_update(x = x2006r, y = int2006[c(join_vars, changevars)], by = join_vars)

If you do want to roll your own, at least start with a join. You can see a few dplyr-based implementations here. I believe data.table also does this very well.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    Since my reputation is not enough, it does not appear. This is a correct answer and it is way faster than my nested loop. Thank you very much, you really made my day. Thank you all! You have saved me MANY hours. :) – timeseriescry Apr 06 '21 at 16:46