1

I have this piece of code that I'd like to wrap in a function

indata <- data.frame(id = c(1L, 2L, 3L, 4L, 12L, 13L, 14L, 15L), 
                     fid = c(NA, 9L, 1L, 1L, 7L, 5L, 5L, 5L), 
                     mid = c(0L, NA, 2L, 2L, 6L, 6L, 6L, 8L))
library(data.table)
DT <- as.data.table(indata)

DT[, msib:=.(list(id)), by = mid][                                              
   ,msibs := mapply(setdiff, msib, id)][
   ,fsib  := .(list(id)), by = fid][
   ,fsibs := mapply(setdiff, fsib, id)][
   ,siblist  := mapply(union, msibs, fsibs)][
   ,c("msib","msibs", "fsib", "fsibs") := NULL] 

So far so good. Works as desired. Now it should be wrapped in a function, where I can pass alternative variable names (without quoting if possible), and here's my first try.

f <- function(DT, id, fid, mid) {

    DT[, msib:=.(list(id)), by = mid][                                              
       ,msibs := mapply(setdiff, msib, id)][
       ,fsib  := .(list(id)), by = fid][
       ,fsibs := mapply(setdiff, fsib, id)][
       ,siblist  := mapply(union, msibs, fsibs)][
       ,c("msib","msibs", "fsib", "fsibs") := NULL] 
}

I know this isn't working but lets look at the error it throws

indata2 <- indata
names(indata2) <- c("A", "B", "C")  # Give new names
DT2 <- as.data.table(indata2)
f(DT2, A, B, C)

Error in as.vector(x, "list") : cannot coerce type 'closure' to vector of type 'list'

That makes sense. Now to make sure that the promises are evaluated correctly I tried this

f <- function(DT, id, fid, mid) {
    mid <- deparse(substitute(mid))
    id <- deparse(substitute(id))
    fid <- deparse(substitute(fid))

    DT[, msib:=.(list(id)), by = mid][                                              
       ,msibs := mapply(setdiff, msib, id)][
       ,fsib  := .(list(id)), by = fid][
       ,fsibs := mapply(setdiff, fsib, id)][
       ,siblist  := mapply(union, msibs, fsibs)][
       ,c("msib","msibs", "fsib", "fsibs") := NULL] 
}

That doesn't throw an error but also does not work. The output looks like this

f(DT2, A, B, C)
    A  B  C siblist
1:  1 NA  0        
2:  2  9 NA        
3:  3  1  2        
4:  4  1  2        
5: 12  7  6        
6: 13  5  6        
7: 14  5  6        
8: 15  5  8   

and the siblist column is empty which it shouldn't be and isn't when I run it manually. I also tried this version (converting it to character strings) to see if that worked:

f <- function(DT, id, fid, mid){
    mid <- as.character(substitute(mid))
    id <- as.character(substitute(id))
    fid <- as.character(substitute(fid))
    DT[, msib:=.(list(id)), by = mid][ # Siblings through the mother
       ,msibs := mapply(setdiff, msib, id)][
       ,fsib  := .(list(id)), by = fid][
       ,fsibs := mapply(setdiff, fsib, id)][
       ,siblist  := mapply(union, msibs, fsibs)][
       ,c("msib","msibs", "fsib", "fsibs") := NULL] # Removed unused
}

but that doesn't work either - same output as above. I think it may be because the promises in the j part of the data.table are evaluated in the wrong environment but am not sure. How can I fix my function?

ekstroem
  • 5,957
  • 3
  • 22
  • 48
  • Side note re "so far so good", list columns are pretty inefficient. You might want to check out the igraph package or something else specialized for this sort of data (with links between individuals and whatnot). Assuming you don't mind inefficiency, a quick fix would be to make `DT = setnames(copy(DT), c(mid, id, fid), c("mid", "id", "fid"))` after your deparse/substitute/etc, right? – Frank Aug 30 '17 at 22:16
  • That would work but I need to work on a dataset with 20+ mio edges so I'd prefer not to create an extra copy of the data. I'd love to hear about other options for generating the final lists. I did some early dabbling into timings between different packages for this problem and found that `data.table` came out rather fast, but I'd love to get other ideas for approaches. – ekstroem Aug 30 '17 at 23:28

2 Answers2

2

If you expect an object to have a certain structure or hold certain data, then defining a class can really help. And with S3, it's simple.

as.relationship <- function(DT, id, fid, mid) {
  out <- DT[, c(id, fid, mid), with = FALSE]
  setnames(out, c("id", "fid", "mid"))
  setattr(out, "class", c("relationship", class(out)))
  out
}

Then you can write a function to work on that class with the safety of knowing where everything is.

f <- function(DT, id, fid, mid) {
  relatives <- as.relationship(DT, id, fid, mid)
  relatives[
    relatives,
    on = "fid",
    allow.cartesian = TRUE
  ][
    relatives,
    on = "mid",
    allow.cartesian = TRUE
  ][
    ,
    {
      siblings    <- union(i.id, i.id.1)
      except_self <- setdiff(siblings, .BY[["id"]])
      list(siblist = list(except_self))
    },
    by = "id"
  ]
}

This function takes the column names as strings. So you'd call it like this:

f(DT, "id", "fid", "mid")
#    id  siblist
# 1:  1         
# 2:  2         
# 3:  3        4
# 4:  4        3
# 5: 12    13,14
# 6: 13 14,15,12
# 7: 14 13,15,12
# 8: 15    13,14

setnames(DT, c("A", "B", "C"))
f(DT, "A", "B", "C")
#    id  siblist
# 1:  1         
# 2:  2         
# 3:  3        4
# 4:  4        3
# 5: 12    13,14
# 6: 13 14,15,12
# 7: 14 13,15,12
# 8: 15    13,14

If you're worried about performance, don't be. If you create a data.table from entire columns of another data.table, they're smart enough not to actually copy the data. They share it. So there's no real performance penalty to making another object.

Nathan Werth
  • 5,093
  • 18
  • 25
  • Sorry for the late reply. If I run `f(DT, id, fid, mid)` then I get the following `Error in eval(jsub, parent.frame(), parent.frame()) : object 'fid' not found` – ekstroem Sep 06 '17 at 17:38
  • I should've mentioned the function takes column names as strings. I've updated the answer. – Nathan Werth Sep 06 '17 at 17:44
  • Thanks! Works perfectly (although I'm still surprised by the number of hoops that are necessary to jump) – ekstroem Sep 06 '17 at 19:42
  • The hoops I put out aren't *required* so much as helpful. Using the S3 system for loose classes helps organize your code. The `as.relationship` function is a glorified renamer, and the class attribute is like a tag saying "all's well". – Nathan Werth Sep 06 '17 at 20:25
0

This is getting ugly but it seems to work. With lots of get()s:

f <- function(DT, id, fid, mid) {
  mid <- deparse(substitute(mid))
  id <- deparse(substitute(id))
  fid <- deparse(substitute(fid))

  DT[, msib:=.(list(get(id))), by = get(mid)][                                              
    ,msibs := mapply(setdiff, msib, get(id))][
      ,fsib  := .(list(get(id))), by = get(fid)][
        ,fsibs := mapply(setdiff, fsib, get(id))][
          ,siblist  := mapply(union, msibs, fsibs)][
            ,c("msib","msibs", "fsib", "fsibs") := NULL] 
}

DT2 <- as.data.table(indata2)
f(DT2, A, B, C)

all.equal(DT, DT2)
# [1] "Different column names"
Aurèle
  • 12,545
  • 1
  • 31
  • 49
  • That is an awful lot of `get`'s :o) It works as you write but if I have a `data.table` with column names `id`, `fid`, and `mid` (just as the arguments) then I get the error *Error in get(mid) : invalid first argument*. Guessing it is the first line giving hick-ups? – ekstroem Sep 01 '17 at 00:26