1

I am looking to create a new data.table that contains all rows with at least one negative value.

Here is a simple reproducible datatable:

dt <- data.table(
  ID = c(42, 43, 44),
  Stage_1 = c(-6, 7, 4),
  Stage_2 = c(-15, 4, -8),
  Stage_3 = c(-20, 2, -5)
)

#    ID Stage_1 Stage_2 Stage_3
# 1: 42      -6     -15     -20 # <~~ row to be selected (> 0 negative values)
# 2: 43       7       4       2
# 3: 44       4      -8      -5 # <~~ row to be selected (> 0 negative values)

My desired output would be:

dt2 <- data.table(
  ID = c(42, 44),
  Stage_1 = c(-6, 4),
  Stage_2 = c(-15, -8),
  Stage_3 = c(-20, -5)
)

#    ID Stage_1 Stage_2 Stage_3
# 1: 42      -6     -15     -20
# 2: 44       4      -8      -5

ID 44 for example, has two negative values but I would like to include all of their rows from the main datatable. Basically all rows with a negative value in any of their columns I would like to add to a new datatable that contains all their information.

The actual dataset I'm working with has ~50 stage columns, so the most efficient solution is what I'm after.

Henrik
  • 65,555
  • 14
  • 143
  • 159
winnieeliz
  • 51
  • 1
  • 7
  • Another possibility: `dt[dt[ , do.call(pmin, .SD) < 0, .SDcols = -1]]`, the idea described e.g. [here](https://stackoverflow.com/questions/7885147/efficient-row-wise-operations-on-a-data-table/34019591#comment18073746_7887030) – Henrik Feb 19 '21 at 16:42
  • There's also `Reduce("|", lapply(.SD,` as described e.g. [here](https://stackoverflow.com/a/37567937/1851712) – Henrik Feb 19 '21 at 16:50

4 Answers4

4
dt <- data.table::data.table(
  ID = c(42, 43, 44),
  Stage_1 = c(-6, 7, 4),
  Stage_2 = c(-15, 4, -8),
  Stage_3 = c(-20, 2, -5)
)

dt
#>    ID Stage_1 Stage_2 Stage_3
#> 1: 42      -6     -15     -20
#> 2: 43       7       4       2
#> 3: 44       4      -8      -5
  • with apply() function:
dt[apply(dt[, -'ID'], 1, min) < 0, ]
#>    ID Stage_1 Stage_2 Stage_3
#> 1: 42      -6     -15     -20
#> 2: 44       4      -8      -5
  • with rowMeans() function based on the fact that average of a boolean vector with at least one true value is always greater than zero (thanks to @utubun):
dt[rowMeans(dt[, -'ID'] < 0) > 0, ]
#>    ID Stage_1 Stage_2 Stage_3
#> 1: 42      -6     -15     -20
#> 2: 44       4      -8      -5
  • and also with the rowMins() function of the fBasics package:
dt[fBasics::rowMins(dt[, -'ID']) < 0, ]
#>    ID Stage_1 Stage_2 Stage_3
#> 1: 42      -6     -15     -20
#> 2: 44       4      -8      -5

# Created on 2021-02-19 by the reprex package (v0.3.0.9001)

(Related to Equivalent to rowMeans() for min())

Regards,

barboulotte
  • 395
  • 2
  • 8
  • 1
    This is a really efficient way to execute the code! To avoid changing the code if the amount of columns change in the future I would add `nrow(dt)` to the function. According to this: `dt[apply(dt[,2:nrow(dt)], 1, min) < 0,]` – Albin Feb 19 '21 at 16:41
  • Thank you for the improvement. I modify the message. – barboulotte Feb 19 '21 at 16:52
  • 2
    Is it me, or actually you meant `dt[, 2:ncol(dt)]`? It works because it is squared data, but otherwise it won't work properly. – utubun Feb 19 '21 at 17:59
  • 1
    Oh yes, indeed, thank you very much. I correct. – barboulotte Feb 19 '21 at 18:00
  • 1
    Yes, sorry, my bad. @utubun thanks for checking – Albin Feb 19 '21 at 18:06
  • 1
    Yes, instead of `apply()` you can use `rowMeans()` just based on the fact, that average of a boolean vector with at least one true value is always greater than zero, which results in `dt[rowMeans(dt[, -'ID'] < 0) > 0, ]` – utubun Feb 19 '21 at 18:17
  • 1
    Thank you for the solution with rowMeans(). I added it to the message. I kept also the solution with apply() because I think it can help to understand. I also replaced `2:ncol(dt)` by `-'ID'` in the solution with apply. – barboulotte Feb 19 '21 at 18:43
  • I updated the response with fBasics::rowMins() function. – barboulotte Feb 19 '21 at 23:57
2
dt[Stage_1 < 0 | Stage_2 < 0 | Stage_3 < 0]

#    ID Stage_1 Stage_2 Stage_3
# 1: 42      -6     -15     -20
# 2: 44       4      -8      -5

Edit after OP's clarification:

With many columns:

# Find all the rows with at least one negative in each column
rowsDT <- dt[, lapply(.SD, function(x) which(x < 0)), .SDcols = -'ID']
# Reduce to a vector by applying union
rows <- Reduce(union, rowsDT)
# Extract from the main data.table
dt[rows]
Drumy
  • 450
  • 2
  • 16
  • Thank you! I should've included this in my original question, but I'd like to avoid this kind of syntax because the actual dataset I'm working with has about ~50 stage columns – winnieeliz Feb 19 '21 at 15:59
  • Roger that. I updated the answer accordingly. Not sure if this is the most efficient, but it should work relatively fast. – Drumy Feb 19 '21 at 16:23
  • Thank you! Worked perfectly – winnieeliz Feb 19 '21 at 16:29
1

Don't know if it is really good style, but as a shorthand you can use this:

dt[ dt[, any(.SD < 0), by = ID][['V1']], ]

#    ID Stage_1 Stage_2 Stage_3
# 1: 42      -6     -15     -20
# 2: 44       4      -8      -5

¹However, if ID is not unique, it can produce unexpected results.

utubun
  • 4,400
  • 1
  • 14
  • 17
  • 1
    `ID` could be replaced by `list(rownames(dt))` to have unique identifier. – barboulotte Feb 19 '21 at 17:33
  • Wanted to keep it short. But yes, since there's no need of `ID` in subsetting call, there is a way around. I wonder how profiling would look like for all of the solutions. – utubun Feb 19 '21 at 17:49
0

A data.table option using sign + rowSums

> dt[, .SD[rowSums(sign(.SD)) < length(.SD)], ID]
   ID Stage_1 Stage_2 Stage_3
1: 42      -6     -15     -20
2: 44       4      -8      -5
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81