Select rows with at least one (any) negative value

Question

I am looking to create a new data.table that contains all rows with at least one negative value.

Here is a simple reproducible datatable:

dt <- data.table(
  ID = c(42, 43, 44),
  Stage_1 = c(-6, 7, 4),
  Stage_2 = c(-15, 4, -8),
  Stage_3 = c(-20, 2, -5)
)

#    ID Stage_1 Stage_2 Stage_3
# 1: 42      -6     -15     -20 # <~~ row to be selected (> 0 negative values)
# 2: 43       7       4       2
# 3: 44       4      -8      -5 # <~~ row to be selected (> 0 negative values)

My desired output would be:

dt2 <- data.table(
  ID = c(42, 44),
  Stage_1 = c(-6, 4),
  Stage_2 = c(-15, -8),
  Stage_3 = c(-20, -5)
)

#    ID Stage_1 Stage_2 Stage_3
# 1: 42      -6     -15     -20
# 2: 44       4      -8      -5

ID 44 for example, has two negative values but I would like to include all of their rows from the main datatable. Basically all rows with a negative value in any of their columns I would like to add to a new datatable that contains all their information.

The actual dataset I'm working with has ~50 stage columns, so the most efficient solution is what I'm after.

Another possibility: `dt[dt[ , do.call(pmin, .SD) < 0, .SDcols = -1]]`, the idea described e.g. [here](https://stackoverflow.com/questions/7885147/efficient-row-wise-operations-on-a-data-table/34019591#comment18073746_7887030) — Henrik, Feb 19 '21 at 16:42
There's also `Reduce("|", lapply(.SD,` as described e.g. [here](https://stackoverflow.com/a/37567937/1851712) — Henrik, Feb 19 '21 at 16:50

barboulotte · Answer 1 · 2021-02-19T23:55:09.800

4

dt <- data.table::data.table(
  ID = c(42, 43, 44),
  Stage_1 = c(-6, 7, 4),
  Stage_2 = c(-15, 4, -8),
  Stage_3 = c(-20, 2, -5)
)

dt
#>    ID Stage_1 Stage_2 Stage_3
#> 1: 42      -6     -15     -20
#> 2: 43       7       4       2
#> 3: 44       4      -8      -5

with apply() function:

dt[apply(dt[, -'ID'], 1, min) < 0, ]
#>    ID Stage_1 Stage_2 Stage_3
#> 1: 42      -6     -15     -20
#> 2: 44       4      -8      -5

with rowMeans() function based on the fact that average of a boolean vector with at least one true value is always greater than zero (thanks to @utubun):

dt[rowMeans(dt[, -'ID'] < 0) > 0, ]
#>    ID Stage_1 Stage_2 Stage_3
#> 1: 42      -6     -15     -20
#> 2: 44       4      -8      -5

and also with the rowMins() function of the fBasics package:

dt[fBasics::rowMins(dt[, -'ID']) < 0, ]
#>    ID Stage_1 Stage_2 Stage_3
#> 1: 42      -6     -15     -20
#> 2: 44       4      -8      -5

# Created on 2021-02-19 by the reprex package (v0.3.0.9001)

(Related to Equivalent to rowMeans() for min())

Regards,

edited Feb 19 '21 at 23:55

answered Feb 19 '21 at 16:28

barboulotte

395
2
8

1

This is a really efficient way to execute the code! To avoid changing the code if the amount of columns change in the future I would add `nrow(dt)` to the function. According to this: `dt[apply(dt[,2:nrow(dt)], 1, min) < 0,]` – Albin Feb 19 '21 at 16:41
Thank you for the improvement. I modify the message. – barboulotte Feb 19 '21 at 16:52
2

Is it me, or actually you meant `dt[, 2:ncol(dt)]`? It works because it is squared data, but otherwise it won't work properly. – utubun Feb 19 '21 at 17:59
1

Oh yes, indeed, thank you very much. I correct. – barboulotte Feb 19 '21 at 18:00
1

Yes, sorry, my bad. @utubun thanks for checking – Albin Feb 19 '21 at 18:06
1

Yes, instead of `apply()` you can use `rowMeans()` just based on the fact, that average of a boolean vector with at least one true value is always greater than zero, which results in `dt[rowMeans(dt[, -'ID'] < 0) > 0, ]` – utubun Feb 19 '21 at 18:17
1

Thank you for the solution with rowMeans(). I added it to the message. I kept also the solution with apply() because I think it can help to understand. I also replaced `2:ncol(dt)` by `-'ID'` in the solution with apply. – barboulotte Feb 19 '21 at 18:43
I updated the response with fBasics::rowMins() function. – barboulotte Feb 19 '21 at 23:57

Drumy · Accepted Answer · 2021-02-19T16:22:21.033

2

dt[Stage_1 < 0 | Stage_2 < 0 | Stage_3 < 0]

#    ID Stage_1 Stage_2 Stage_3
# 1: 42      -6     -15     -20
# 2: 44       4      -8      -5

Edit after OP's clarification:

With many columns:

# Find all the rows with at least one negative in each column
rowsDT <- dt[, lapply(.SD, function(x) which(x < 0)), .SDcols = -'ID']
# Reduce to a vector by applying union
rows <- Reduce(union, rowsDT)
# Extract from the main data.table
dt[rows]

edited Feb 19 '21 at 16:22

answered Feb 19 '21 at 15:57

Drumy

450
2
16

Thank you! I should've included this in my original question, but I'd like to avoid this kind of syntax because the actual dataset I'm working with has about ~50 stage columns – winnieeliz Feb 19 '21 at 15:59
Roger that. I updated the answer accordingly. Not sure if this is the most efficient, but it should work relatively fast. – Drumy Feb 19 '21 at 16:23
Thank you! Worked perfectly – winnieeliz Feb 19 '21 at 16:29

utubun · Answer 3 · 2021-02-19T17:18:16.087

1

Don't know if it is really good style, but as a shorthand you can use this:

dt[ dt[, any(.SD < 0), by = ID][['V1']], ]

#    ID Stage_1 Stage_2 Stage_3
# 1: 42      -6     -15     -20
# 2: 44       4      -8      -5

¹However, if ID is not unique, it can produce unexpected results.

edited Feb 19 '21 at 17:18

answered Feb 19 '21 at 17:10

utubun

4,400
1
14
17

1

`ID` could be replaced by `list(rownames(dt))` to have unique identifier. – barboulotte Feb 19 '21 at 17:33
Wanted to keep it short. But yes, since there's no need of `ID` in subsetting call, there is a way around. I wonder how profiling would look like for all of the solutions. – utubun Feb 19 '21 at 17:49

score 0 · Answer 4 · answered Feb 19 '21 at 23:10

0

A data.table option using sign + rowSums

> dt[, .SD[rowSums(sign(.SD)) < length(.SD)], ID]
   ID Stage_1 Stage_2 Stage_3
1: 42      -6     -15     -20
2: 44       4      -8      -5

answered Feb 19 '21 at 23:10

ThomasIsCoding

96,636
9
24
81

Select rows with at least one (any) negative value

4 Answers4