3

In a new user created function I like to do some data.table transformations, especially I like to create a new column with the ':=' command.

Assume I like to make a new column called Sex that capitalizes the first letter of the column df$sex in my example data.frame df.

The output of my prepare function should be a data.table with the same name as before but with the additional "capitalised" column.

I try several ways to loop over the data.table. However I always get the following warning (and no correct output):

Warning message: In [.data.table(x, , :=(Sex, stringr::str_to_title(sex))) : Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

library(data.table)
library(magrittr)
library(stringr)


df <- data.frame("age" = c(17, 04), 
                      sex = c("m", "f"))
df %>%   setDT()
is.data.table(df)

This is the easiest way to write my function:

prepare1<-function(x){
x[,Sex:=stringr::str_to_title(sex)]
}
prepare1(df)
#--> WARNING. (as block quoted above)


prepare2<-function(x){
  x[, `:=`(Sex, stringr::str_to_title(sex))]
}
prepare2(df)
#--> WARNING. . (as block quoted above)


prepare3<-function(x){
  require(data.table)
  y <-as.data.table(list(x))
  y <- y[,Sex:=stringr::str_to_title(sex)]
  x <<- y
}
prepare3(df)

The last version does NOT throw the warning, but makes a new dataset called x. But I wanted to override the dataset I put in the function (if I have to go that way at all.)

From the := help file I also know I can use set, however I am not able to adapt the command appropriate. In case that could cure my problem I am happy to receive help on that, too! set(x, i = NULL, Sex, str_to_title(sex)) is apparently wrong ...

Up on request/to make the discussion in the comments clearer I show how my code produces the problem

    library(data.table)
library(stringr)


df <- data.frame("age" = c(17, 04), 
                      sex = c("m", "f"))

GetLastAssigned <- function(match = "<- *data.frame",
                            remove = " *<-.*") {
  f <- tempfile()
  savehistory(f)
  history <- readLines(f)
  unlink(f)
  match <- grep(match, history, value = TRUE)
  get(sub(remove, "", match[length(match)]))
}

#ok, no need for magrittr
setDT(GetLastAssigned())

#check the last function worked
is.data.table(df)

prepare1<-function(x){
x[,Sex:=stringr::str_to_title(sex)]
}

prepare1(GetLastAssigned())
# I get a warning and it does not work.
prepare1(df)
# I get a warning and it does not work, either.


#If I manually type setDT(df) everything works fine but I cannot type the "right" dfs at all the places where I need to do this transformation. 
canIchangethis
  • 87
  • 1
  • 10
  • 1
    The culprit appears to be `magrittr`. If you just do `setDT(df)` this works as intended. – Roland Jul 24 '19 at 13:23
  • If you look at the source of `\`%>%\`` you see quite a few functions that are good candidates for this kind of issues. – Roland Jul 24 '19 at 13:25
  • thank you. but I need the Magrittr as I am in my real application need not to set df through another function. I.e. "myotherfunction" returns df. But it needs to be "myotherfunction %>% setDT() , or setDT(myotherfunction). – canIchangethis Jul 24 '19 at 13:26
  • Well, you can try opening an issue on the data.table bug tracker. Personally, I see absolutely no need for using magrittr (and thus don't use it). – Roland Jul 24 '19 at 13:29
  • Dear Roland I usually also don't use the Magrittr but as setDT(Myotherfunction()) does not work to set the data.frame dt (recalled through "myotherfunction")) as a data.table I need a work-around. :( How can I see the source of `%>%` ? – canIchangethis Jul 24 '19 at 13:32
  • Just do `res <- Myotherfunction(); setDT(res)`. I really don't see the problem. You can see the source code by copying the code in my previous comment (including the backticks) into R. – Roland Jul 24 '19 at 13:34
  • 2
    But have an upvote. This is a very well written question with a nice reproducible example. – Roland Jul 24 '19 at 13:35
  • Dear Roland I did not know about the usage of the additional backpacks as well as the motherfunction, as I am very new to R. Thank you so much for your detailed advice (and the upvote, too). "Dankeschön!/Thanks a lot!" – canIchangethis Jul 24 '19 at 13:37
  • I tried without magrittr but it still is not the solution. I still get these errors. – canIchangethis Jul 24 '19 at 13:48
  • 1
    @Roland Not sure if it's the same issue, but I have run into related problems https://github.com/Rdatatable/data.table/issues/1628 which links to https://stackoverflow.com/a/26072152 where Arun in 2014 closed with "The idea so far is to use `setDT` to convert to data.tables before providing it to a function. But I'd like that these cases be resolved" – Frank Jul 24 '19 at 14:28
  • You might want to select a better title (click "edit" below the question to start), since the question has no mention of errors or "list()" – Frank Jul 24 '19 at 14:30
  • @Frank I changed the title. Thank you. However I am not really capable of understanding if it is really the same (for 5 years) unresolved problem. :( :/ – canIchangethis Jul 24 '19 at 14:39
  • @canIchangethis Yes, I am not sure if it is either. I guess the advice Arun gave then "to convert to data.tables before providing it to a function" is still the best way to go, if possible, though. – Frank Jul 24 '19 at 14:41
  • Well I can provide some more details, how I currently work. Will put it in the question. – canIchangethis Jul 24 '19 at 14:58

1 Answers1

1

A workaround along the OP's lines:

library(data.table)
library(stringr)

GetLastAssigned2 <- function(match = "<- *data.frame", remove = " *<-.*") {
  f <- tempfile()
  savehistory(f)
  history <- readLines(f)
  unlink(f)
  match <- grep(match, history, value = TRUE)
  nm <- sub(remove, "", match[length(match)])
  list(nm = as.name(nm), addr = address(get(nm)))
}

prepit <- function(x){
  x[,Sex:=stringr::str_to_title(sex)]
}

# usage
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
z <- GetLastAssigned2()
eval(substitute(setDT(x), list(x=z$nm)))

str(df) # it seemingly works, since there is a selfref

# usage 2
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
setDT(df)
prepit(df)
str(df) # works

# usage 3
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
z <- GetLastAssigned2()
eval(substitute(setDT(x), list(x=z$nm)))
eval(substitute(prepit(x), list(x=z$nm)))
str(df) # works

Some big caveats:

  • savehistory is only effective in interactive use, based on my reading of the docs
  • using regex on human input (code typed in interactively) is complicated and risky
  • even this workaround will fail if data.table x passed to prepit is not sufficiently "pre-allocated" space for extra columns

The data.table interface is based on passing the name/symbol of the data.frame or data.table, rather than the value (which is what get provides), as explained by Arun one of the data.table authors. Note that the address cannot be passed around either. z$address soon fails to match address(df) in all examples above.


If I manually type setDT(df) everything works fine but I cannot type the "right" dfs at all the places where I need to do this transformation.

One idea:

# helper to compose expressions
subit = function(cmd, df_nm) 
  do.call("substitute", list(cmd, list(x=as.name(df_nm))))

# list of expressions with x where the df name belongs
my_cmds = list(
  setDT  = quote(setDT(x)),
  prepit = quote(x[,Sex:=stringr::str_to_title(sex)])
)

# usage 4
df = data.frame("age" = c(17, 04), sex = c("m", "f"))
df_nm = "df" # somehow get this... hopefully not via regex of command history
eval(subit(my_cmds$setDT, df_nm))
eval(subit(my_cmds$prepit, df_nm))

# usage 5
df = data.frame("age" = c(17, 04), sex = c("m", "f"))
df_nm = "df" 
for(ex in lapply(my_cmds, subit, df_nm = df_nm)) eval(ex)

I think this is more aligned with recommended programmatic usage of data.table.

There is probably some way to wrap this in a function by altering the envir= argument to eval() but I'm not knowledgeable about that.

Regarding how to get the name of the assignment target in nm <- data.frame(...), it looks like there are no good options. Maybe see How do I access the name of the variable assigned to the result of a function within the function? or Get name of x when defining `(<-` operator

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    If your R code doesn't know the name of an object, you simply have designed it badly. I can't think of a valid reason to need arbitrary object names. – Roland Jul 25 '19 at 04:03
  • Dear Frank, thanks a lot, I will check this out. Dear Roland, well if I perform the prepare(x) command on 28 datasets (x) I do not always want to put the names of the datasets. this is why I think I need arbitrary object names. I really wonder why this is so a "non-issue" for you? – canIchangethis Jul 25 '19 at 08:24
  • Frank, well the whole "df_nm = "df" # somehow get this... hopefully not via regex of command history" part after the # is confusing, as I never came across any other way to get this stuf. But besides that, interesting approach. I still need more time to completely understand the code. I am very new to R. – canIchangethis Jul 25 '19 at 08:32