3

I have a dataframe that looks like:

animal_id   trait_id    sire_id dam_id
    1         25.05        0       0
    2         -46.3        1       2
    3          41.6        1       2
    4         -42.76       3       4
    5         -10.99       3       4
    6         -49.81       5       4

I want to create another variable that contains the estimate of "trait_id" for each "sire_id" and "dam_id".

All sires (sire_id) and dams (dam_id) are also present in the animal_id column. So what I want to do is to look for their measurement in the trait_id and repeat this variable in the new variable.

The outcome that I want is:

animal_id   trait_id    sire_id trait_sire  dam_id  trait_dam
     1       25.05         0        NA        0        NA
     2       -46.3         1       25.05      2       -46.3
     3       41.6          1       25.05      2       -46.3
     4      -42.76         3       41.6       4       -42.76
     5      -10.99         3       41.6       4       -42.76
     6      -49.81         5      -10.99      4       -42.76

Any suggestion would be greatly appreciated.

Jaap
  • 81,064
  • 34
  • 182
  • 193
PaulaF
  • 393
  • 3
  • 17

3 Answers3

5

You can use match; match(col, df$animal_id) gives corresponding index of elements from col in the animal_id, which can be used further to locate the values of trait:

df[c("trait_sire", "trait_dam")] <- 
    lapply(df[c("sire_id", "dam_id")], function(col) df$trait_id[match(col, df$animal_id)])

df
#  animal_id trait_id sire_id dam_id trait_sire trait_dam
#1         1    25.05       0      0         NA        NA
#2         2   -46.30       1      2      25.05    -46.30
#3         3    41.60       1      2      25.05    -46.30
#4         4   -42.76       3      4      41.60    -42.76
#5         5   -10.99       3      4      41.60    -42.76
#6         6   -49.81       5      4     -10.99    -42.76
Psidom
  • 209,562
  • 33
  • 339
  • 356
5

With data.table joins...

library(data.table)
setDT(DT)    

DT[, trait_sire := 
  .SD[.SD, on=.(animal_id = sire_id), x.trait_id ]
]

DT[, trait_dam := 
  .SD[.SD, on=.(animal_id = dam_id), x.trait_id ]
]

   animal_id trait_id sire_id dam_id trait_sire trait_dam
1:         1    25.05       0      0         NA        NA
2:         2   -46.30       1      2      25.05    -46.30
3:         3    41.60       1      2      25.05    -46.30
4:         4   -42.76       3      4      41.60    -42.76
5:         5   -10.99       3      4      41.60    -42.76
6:         6   -49.81       5      4     -10.99    -42.76

The syntax is x[i, on=, j] where j is some function of the columns. To see how it works, try out DT[DT, on=.(animal_id = dam_id)] and variations. Some notes:

  1. The i.* / x.* syntax helps to distinguish where a column is taken from.
  2. When j is v := expression, the expression is assigned to column, v.
  3. The join x[i, ...] uses rows of i to look up rows of x.
  4. The on= syntax is like .(xcol = icol).
  5. Inside j, the table itself can be written as .SD.

One advantage of this approach over match is that it extends to joins on more than one column, like on = .(xcol = icol, xcol2 = icol2) or even "non equi joins" like on = .(xcol < icol). Also, it's part of a consistent syntax for operating on the table (explained in the package's introductory material), rather than specialized code for one task.

Frank
  • 66,179
  • 8
  • 96
  • 180
1

You can do this using match (in base R) in one run (no need to loop over)

df[c("trait_sire", "trait_dam")] <- 
cbind(with(df, trait_id[match(sire_id, animal_id)]), 
      with(df, trait_id[match(dam_id, animal_id)]))

  # animal_id trait_id sire_id dam_id trait_sire trait_dam
# 1         1    25.05       0      0         NA        NA
# 2         2   -46.30       1      2      25.05    -46.30
# 3         3    41.60       1      2      25.05    -46.30
# 4         4   -42.76       3      4      41.60    -42.76
# 5         5   -10.99       3      4      41.60    -42.76
# 6         6   -49.81       5      4     -10.99    -42.76
989
  • 12,579
  • 5
  • 31
  • 53
  • I think you can move the `with` outside of `cbind` so it's only written once. However, this kind of looks like a rehash of Psidom's answer, eh. – Frank Apr 11 '17 at 15:38
  • @Frank As for `with`, yes indeed, thanks for raising this point. Although might look like rehash, but mind the absence of `lapply`. I'd also expect to be faster. – 989 Apr 11 '17 at 15:41
  • 1
    Ok, I guess that's a fair perspective, but to me it seems like avoiding a loop by... writing out all the iterations of the loop. (Admittedly, my answer also writes out both separately, so I can't really complain.) – Frank Apr 11 '17 at 15:42