2

I have three data frames dbase, data and dcode.

I want to copy the rows from dbase to data when dbase$idbase = data$iddata and add based on dcode the names of dcode when dbase$nr matches the number in dcode.

data <- 
  structure(list(iddata = 11103:11121), .Names = "iddata", class = "data.frame", row.names = c(NA, -19L))

dbase <-
 structure(list(idbase = 11102:11116, nr = c(115L, 116L, 117L, 
 118L, 37L, 37L, 37L, 41L, 41L, 51L, 51L, 59L, 74L, 77L, 85L)), .Names = c("idbase", 
 "nr"), class = "data.frame", row.names = c(NA, -15L))


dcode <- 
 structure(list(X32 = c(35L, 36L, 37L, 49L, 50L, 51L, 90L, 99L, 
 100L, 101L, 103L), X23 = c(26L, 27L, 28L, 29L, 30L, 31L, 38L, 
 39L, 40L, 41L, 42L), X9 = c(10L, 11L, 12L, 13L, 15L, 16L, 17L, 
 25L, 33L, 34L, 120L)), .Names = c("X32", "X23", "X9"), class = "data.frame", row.names = c(NA, -11L))

My aim is this table

iddata  idbase  nr  foc
11102   11102   115 0
11103   11103   116 0
11104   11104   117 0
11105   11105   118 0
11106   11106   37  X32
11107   11107   37  X32
11108   11108   37  X32
11109   11109   41  X23
11110   11110   41  X23
11111   11111   51  X32
11112   11112   51  X32
11113   11113   59  0
11114   11114   74  0
11115   11115   77  0
11116   11116   85  0
11117   0       0
11118   0       0
11119   0       0
11120   0       0
11121   0       0
Adam
  • 434
  • 1
  • 4
  • 18

2 Answers2

3

Could try some merge manipulations such as

temp <- merge(data, dbase, by.x = "iddata", by.y = "idbase", all = TRUE)
temp2 <- data.frame("nr" = unlist(dcode))
temp2$foc <- rep(names(dcode), each = nrow(dcode))
temp <- merge(temp, temp2, by = "nr", all.x = TRUE)
temp[order(temp$iddata), ]
#     nr iddata  foc
# 12 115  11102 <NA>
# 13 116  11103 <NA>
# 14 117  11104 <NA>
# 15 118  11105 <NA>
# 1   37  11106  X32
# 2   37  11107  X32
# 3   37  11108  X32
# 5   41  11109  X23
# 4   41  11110  X23
# 6   51  11111  X32
# 7   51  11112  X32
# 8   59  11113 <NA>
# 9   74  11114 <NA>
# 10  77  11115 <NA>
# 11  85  11116 <NA>
# 19  NA  11117 <NA>
# 16  NA  11118 <NA>
# 17  NA  11119 <NA>
# 18  NA  11120 <NA>
# 20  NA  11121 <NA>
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thank you @DavidArenburg for the solution, is there also an SQL kind of solution, just wondering! – Adam Nov 23 '14 at 15:01
  • For sure there is. Are you specifically looking for one? – David Arenburg Nov 23 '14 at 15:15
  • Just to learn, when I wanted to tackle the problem I thought of SQL (a quite similar problem, although not for R, brought me to this idea), I thought it would be faster (just a hunch). – Adam Nov 23 '14 at 15:54
  • There seems to be a down voting spirit here. Yes I agree, some argument would be nice. Just down voting is not that constructive, I think. I’m not imprested (and don’t give a…. about points) by down voting people who have a high K (a bunch of points), but no arguments. I’m impressed by people that contribute in a constructive way (the high and no K with arguments) to a solution by sharing their deepest thoughts and concerns about the question or the solutions. – Adam Nov 26 '14 at 09:53
  • I would guess it's Gabor, the developer of `sqldf` package. When I'll have some time maybe I'll add the sql solution, though I doubt it will be faster. The fastest solution would be using the `data.table` binary merge, but you have a special case of outer join here, which cannot be done (yet) with the binary merge. For some examples of this merge, see my answer from [here](http://stackoverflow.com/questions/27105686/merging-two-data-frames-with-different-sizes-by-matching-their-columns/27107891#27107891) – David Arenburg Nov 26 '14 at 09:58
  • I hoop that Gabor is willing to share his optimal sqldf solution, that would be nice. – Adam Nov 26 '14 at 10:10
2

Here's how I'd tackle it with dplyr and tidyr. First tidy dcode so that variables are in the columns, and use the same variable name in data and dbase:

library("dplyr")
library("tidyr")

data <- data %>% rename(id = iddata)
dbase <- dbase %>% rename(id = idbase)
dcode_tidy <- dcode %>% 
  gather(foc, nr)

Then you only need two simple joins to get the result you want:

data %>%
  full_join(dbase) %>%
  left_join(dcode_tidy)
#>     nr    id  foc
#> 1  116 11103 <NA>
#> 2  117 11104 <NA>
#> 3  118 11105 <NA>
#> 4   37 11106  X32
#> 5   37 11107  X32
#> 6   37 11108  X32
#> 7   41 11109  X23
#> 8   41 11110  X23
#> 9   51 11111  X32
#> 10  51 11112  X32
#> ...

(You'll need the dev version of dplyr to get the full join)

As with most data manipulation challenges, the answer is easy once you have the data in the right form.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • @KFB updated to use full joins instead of left joins – hadley Nov 25 '14 at 13:06
  • `full_join` is new to the `dplyr` arsenal. Thanks – KFB Nov 25 '14 at 13:53
  • Thank you for your solution @hadley, It is a ‘clean’ solution. I will dive in tidyr.I read your paper, very interesting. One thing surprised me (I don’t think this is the place to share thoughts on that paper but…) was the sentence “This dataset has three variables, religion, income and frequency.” on page p.6. In research (which is not the same as in math, I think), as far I know (and I don’t know much), the frequency is not a variable. – Adam Nov 26 '14 at 09:59
  • For my clarity, measuring e.g. length of humans’ results in the variable length, the frequency is the occurrence of a particular length (the variable). So length (the variable) can only vary in the existents’ of a frequency (?). My question is, can there be, in research, a frequency without another variable and vice versa? Or is this mythological monster with a goat's body and a lion’s head? – Adam Nov 26 '14 at 10:02
  • @Adam that is a tricky philosophy point, centred on what you're "observing". If an observation is a group of people, then frequency is a variable that represents the number of people in that group. – hadley Nov 26 '14 at 14:11