1

I am using the data.table package to work with a very large data set, and value it's speed and clarity. But I am new to it and am having difficulties chaining functions together especially when working with a mixed set of data.table and base R functions. My question is, how do I chain together the below example functions, into one seamless string of code for defining the target data object?

Below is the correct output, generated by running each line of code separately (unchained) with the generating code shown immediately beneath the output:

> data
    ID Period State Values
 1:  1      1    X0      5
 2:  1      2    X1      0
 3:  1      3    X2      0
 4:  1      4    X1      0
 5:  2      1    X0      1
 6:  2      2    XX      0
 7:  2      3    XX      0
 8:  2      4    XX      0
 9:  3      1    X2      0
10:  3      2    X1      0
11:  3      3    X9      0
12:  3      4    X3      0
13:  4      1    X2      1
14:  4      2    X1      2
15:  4      3    X9      3
16:  4      4    XX      0

library(data.table)

data <- 
  data.frame(
    ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
    Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
    Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
    Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
    State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
  )

# changes State to "XX" if remaining Values_1 + Values_2 cumulative sums = 0 for each ID: 
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID]

# create new column "Values", which equals "Values_1":
setDT(data)[,Values := Values_1] 

# in base R, drops columns Values_1 and Values_2:
data <- subset(data, select = -c(Values_1,Values_2)) # How to do this step in data.table, if possible or advisable?  

# in base R, changes all "XX" elements in State column to "HI":
data$State <- gsub('XX','HI', data$State) # How to do this step in data.table, if possible or advisable?  

For what it's worth, below is my attempt to chain together using '%>%' pipe operators, which fails (error message Error in data$State : object of type 'closure' is not subsettable), and though I'd rather chain together using data.table operators:

data <- 
  data.frame(
    ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
    Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
    Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
    Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
    State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
  ) %>%
  setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
  setDT(data)[,Values := Values_1] %>%
  subset(data, select = -c(Values_1,Values_2)) %>%
  data$State <- gsub('XX','HI', data$State)

3 Answers3

4

If I understand correctly, the OP wants to

  • rename column Value_1 to Value (or in OP's words: create new column "Values", which equals "Values_1")
  • drop column Value_2
  • replace all occurrences of XX by HI in column State

Here is what I would do in data.table syntax:

setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
  , Values_2 := NULL][
    State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
       ID Period Values  State
 1:     1      1      5     X0
 2:     1      2      0     X1
 3:     1      3      0     X2
 4:     1      4      0     X1
 5:     2      1      1     X0
 6:     2      2      0     HI
 7:     2      3      0     HI
 8:     2      4      0     HI
 9:     3      1      0     X2
10:     3      2      0     X1
11:     3      3      0     X9
12:     3      4      0     X3
13:     4      1      1     X2
14:     4      2      2     X1
15:     4      3      3     X9
16:     4      4      0     HI

setnames() updates by reference, e.g., without copying. There is no need to create a copy of Values_1 and delete Values_1 later on.

Also, [State == "XX", State := "HI"] replaces XX by HI only in affected rows by reference while
[, State := gsub('XX','HI', State)] replaces the whole column.

data.table chaining is used where appropriate.

BTW: I wonder why the replacement of XX by HI cannot be done rightaway in the first statement:

setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
  , Values_2 := NULL][]
setnames(data, "Values_1", "Values")
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Given the size of data files I'm working with, I'm going to spend time learning more about data.table and will need to convert a lot of my code from base R/dplyr to data.table. I've been running the above solutions against the full data set, and speed checking, and the speed boost offered by data.table is significant – Curious Jorge - user9788072 May 16 '22 at 06:22
  • 1
    Some years ago, before the advent of the *tidyverse* was heralded, I decided to start with `data.table`. I prefer the concise, SQL-like syntax, the outstanding join features, *updates by reference*, and many handy functions like `rowid()`, `rleid()`, `fread()`, e.g. Speed and low memory footprint come as bonus. – Uwe May 16 '22 at 08:26
  • 1
    However, if speed is a concern to you, @CuriousJorge-user9788072, you may have a look at the [`collapse` package](https://sebkrantz.github.io/collapse/) which seems to be even faster than `data.table`. I have no experience with it, yet, but if have seen some answers on SO where `collapse` has a speed advantage. – Uwe May 16 '22 at 08:35
2

You can just chain using bracket notation [. That way you only need to call setDT() once, as you are continuing all operations in the data.table universe, so data does not stop being a data.table. Also setDT() modifies in place, so it does not need assignment (although by piping to it its return value is being assigned to data which is fine, too).

First define the data and make it a data.table:


library(data.table)
data <-
    data.frame(
        ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
        Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
        Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
        Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
        State = c("X0", "X1", "X2", "X1", "X0", "X2", "X0", "X0", "X2", "X1", "X9", "X3", "X2", "X1", "X9", "X3")
    ) |>
    setDT()

Then define the columns you need. Note the functional notation to apply a function on several columns.

data[, `:=`(
    State = ifelse(
        rev(cumsum(rev(Values_1 + Values_2))),
        State, "XX"
    )
),
by = ID
][
    ,
    `:=`(
        Values = Values_1,
        Values_1 = NULL,
        Values_2 = NULL,
        State = gsub("XX", "HI", State)
    )
]

Output:

data
#     ID Period State Values
#  1:  1      1    X0      5
#  2:  1      2    X1      0
#  3:  1      3    X2      0
#  4:  1      4    X1      0
#  5:  2      1    X0      1
#  6:  2      2    HI      0
#  7:  2      3    HI      0
#  8:  2      4    HI      0
#  9:  3      1    X2      0
# 10:  3      2    X1      0
# 11:  3      3    X9      0
# 12:  3      4    X3      0
# 13:  4      1    X2      1
# 14:  4      2    X1      2
# 15:  4      3    X9      3
# 16:  4      4    HI      0

You may want to read further about chaining commands in data.table. I think that page is an excellent summary of the syntax and features of the package and is worth reading in its entirety.

SamR
  • 8,826
  • 3
  • 11
  • 33
1

You can use the magrittr package to chaining data.tables using . before [. Try the following code:

library(dplyr)
library(magrittr)
library(data.table)
data <- 
  data.frame(
    ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
    Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
    Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
    Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
    State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
  ) %>% 
  setDT(data) %>%
  .[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
  .[,Values := Values_1] %>%
  select(-c(Values_1, Values_2)) %>%
  mutate(State = gsub('XX','HI', State))

Output:

   rn ID Period State Values
 1:  1  1      1    X0      5
 2:  2  1      2    X1      0
 3:  3  1      3    X2      0
 4:  4  1      4    X1      0
 5:  5  2      1    X0      1
 6:  6  2      2    HI      0
 7:  7  2      3    HI      0
 8:  8  2      4    HI      0
 9:  9  3      1    X2      0
10: 10  3      2    X1      0
11: 11  3      3    X9      0
12: 12  3      4    X3      0
13: 13  4      1    X2      1
14: 14  4      2    X1      2
15: 15  4      3    X9      3
16: 16  4      4    HI      0
Quinten
  • 35,235
  • 5
  • 20
  • 53
  • 1
    Do you need to add a `library(dplyr)` here for your `select()` and `mutate()` calls? Also [in general](https://stackoverflow.com/questions/27511604/dplyr-on-data-table-am-i-really-using-data-table) I think using `dplyr` verbs on `data.table` objects is less memory efficient than using the `data.table` equivalents, although in this particular case the difference will be negligible and perhaps your code answers the specific question about chaining with the pipe slightly more. – SamR May 15 '22 at 09:10
  • 1
    @SamR, you are right that `dplyr` is necessary. I think the main point of this question is to use the `data.table` functions in one chain using `%>%`. – Quinten May 15 '22 at 09:25