Remove rows conditionally from a data.table in R

Question

I have a data.table with fields {id, menuitem, amount}.

This is transaction data - so, ids are unique, but menuitem repeats. Now, I want to remove all entries where menuitem == 'coffee'.

Also, want to delete all rows where amount <= 0;

What is the right way to do this in data.table?

I can use data$menuitem!='coffee' and then index int into data[] - but that is not necessarily efficient and does not take advantage of data.table.

Any pointers in the right direction are appreciated.

score 70 · Accepted Answer · edited May 23 '17 at 12:18

70

In this scenario it is not so different than data.frame

data <- data[ menuitem != 'coffee' | amount > 0]

Delete/add row by reference it is to be implemented. You find more info in this question

Regarding speed:

1 You can benefit from keys by doing something like:

setkey(data, menuitem)
data <- data[!"coffee"]

which will be faster than data <- data[ menuitem != 'coffee']. However to apply the same filters you asked in the question you'll need a rolling join (I've finished my lunch break I can add something later :-)).

2 Even without key data.table is much faster for relatively big table (similar speed for handful amount of rows)

dt<-data.table(id=sample(letters,1000000,T),var=rnorm(1000000))
df<-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000))
library(microbenchmark)
> microbenchmark(dt[ id == "a"], df[ df$id == "a",])
Unit: milliseconds
               expr       min        lq    median        uq       max neval
      dt[id == "a"]  24.42193  25.74296  26.00996  26.35778  27.36355   100
 df[df$id == "a", ] 138.17500 146.46729 147.38646 149.06766 154.10051   100

edited May 23 '17 at 12:18

Community

1
1

answered Mar 26 '14 at 08:23

Michele

8,563
6
45
72

Thanks. Just out of curiosity, in this case, does it take advantage of the keys to perform the comparisons `menuitem != 'coffee' | amount > 0` any faster than the typical data.frame? – Gopalakrishna Palem Mar 26 '14 at 13:48
@GopalakrishnaPalem Please see EDIT in the answer – Michele Mar 26 '14 at 14:10
Thank you much Michele. Also, the microbenchmark command is useful, thanks for posting it. – Gopalakrishna Palem Mar 27 '14 at 00:23
How would you use the ":=" or "<- NULL"? to remove the rows containing something by reference . – skan Jan 08 '17 at 21:26
1

OK, I've found it's not yet implemented https://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table – skan Jan 09 '17 at 23:58
I don't think this comparison is fair. Subsetting != removing rows. For instance: `microbenchmark(dt <- dt[ id == "a"], df <- df[ df$id == "a",])` show only x2 performance difference. – Ufos May 18 '18 at 09:35
However OP wants to **remove** entries. Here we should use `!= "a"` for the example. And oh my, data.tables is 2 times slower here. `microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])` -- check it out. I am gonna go open an issue on their github repo. – Ufos May 18 '18 at 09:41
1

^ issue + explanation: https://github.com/Rdatatable/data.table/issues/2890 Should be faster for large N – Ufos May 18 '18 at 12:18
2

Can data.table rows be removed in place, without assigning to a new data.table? (Somewhat similar to the `:=` tool?) – Kayle Sawyer Dec 10 '18 at 19:52

score 3 · Answer 2 · answered Aug 15 '20 at 07:05

try this:

data <- data[ !(menuitem == 'coffee' | amount <= 0),]

Generally:

dt <- data.table(a=c(1,1,1,2,2,2,3,3,3),b=c(4,2,3,1,5,3,4,7,6))
dt
#>    a b
#> 1: 1 4
#> 2: 1 2
#> 3: 1 3
#> 4: 2 1
#> 5: 2 5
#> 6: 2 3
#> 7: 3 4
#> 8: 3 7
#> 9: 3 6
dt[a!=1,]
#>    a b
#> 1: 2 1
#> 2: 2 5
#> 3: 2 3
#> 4: 3 4
#> 5: 3 7
#> 6: 3 6

Remove rows conditionally from a data.table in R

2 Answers2

Linked

Related