Removing columns from a data.table in R based on conditions

Question

How do I remove columns based on values in a data.table in R

If I have a data.table row,

dt = data.table("col1" = "a", "col2" = "b", "col3" = "c", 
"col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
"col9" = 0, "col10" = 99)

The first 5 columns are categorical, and the 6-10 columns are numerical. The numbers are repeated for all rows for the numerical columns.

I had two doubts

How do I remove the columns containing 0s? This column can vary based on inputs - ie sometimes col7 might be 0 sometimes col8 might be 0 etc
Once I remove the columns with 0 values, how do I concatenate the rest of the numbers into a single column - in this case, the new column will contain the number 9799

Is there a way to do this without removing the 0 value columns?

For the first part, I tried

cols_chosen = c("col6", "col7","col8","col9","col10")

condition = c(FALSE, dt[, lapply(.SD, function(x) sum(x)< 1), .SDcols = cols_chosen])

dt[, which(condition) := NULL]

While I am getting the correct value for the conditions (a list of 5 logical values), the last command is failing with the error

Error in which(condition) : argument to 'which' is not logical

I had taken the above statements from an earlier answer Remove columns of dataframe based on conditions in R

jangorecki · Accepted Answer · 2020-11-23T16:55:14.817

2

dt = data.table("col1" = "a", "col2" = "b", "col3" = "c", 
"col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
"col9" = 0, "col10" = 99)

not0 = function(x) is.numeric(x) && !anyNA(x) && all(x!=0)
dt[, .(
    ## your categorical columns
    col1, col2, col3, col4, col5,
    ## new column pasted from non-0 numeric columns
    new = as.numeric(paste0(unlist(.SD), collapse=""))
  ),
  ## this filters columns to be provided in .SD column subset
  .SDcols = not0,
  ## we group by each row so it will handle input of multiple rows
  by = .(row=seq_len(nrow(dt)))
  ][, row:=NULL ## this removes extra grouping column
    ][] ## this prints
#   col1 col2 col3 col4 col5  new
#1:    a    b    c    d    e 9799

Alternatively if you want to update in place existing table

is0 = function(x) is.numeric(x) && !anyNA(x) && all(x==0)
## remove columns that has 0
dt[, which(sapply(dt, is0)) := NULL]

## add new column
dt[, new := as.numeric(
    paste0(unlist(.SD), collapse="")
  ), .SDcols=is.numeric, by=.(row=seq_len(nrow(dt)))
  ][]
#   col1 col2 col3 col4 col5 col6 col8 col10  new
#1:    a    b    c    d    e    9    7    99 9799

edited Nov 23 '20 at 16:55

answered Nov 23 '20 at 16:01

jangorecki

16,384
4
79
160

Thank you for the solution, but how do I split this into two steps? The first part which removes just the 0 and the second part which creates the new column? The code was working on the sample data which I had provided exactly like you mentioned, but when I try it on my dataset, the dt call also pasting the 0 columns together with the non-zero columns in the 'new' column – user3612324 Nov 23 '20 at 16:44
1

maybe you have an integer columns. unlike in your example dataset? – jangorecki Nov 23 '20 at 16:53
1

I updated `not0` function, it should now work for integer field and multiple rows as well, please try @user3612324 – jangorecki Nov 23 '20 at 16:55
1

@user3612324 I added another example of exactly what you asked, first removing 0-columns, then creating new column. – jangorecki Nov 23 '20 at 17:00

score 1 · Answer 2 · edited Nov 24 '20 at 15:31

To remove the column containing (assuming the numbers are repeated as you say), it would be as simple as checking the first row to see if any elements are equal to 0 and keeping the columns that aren't:

dt <- data.frame("col1" = "a", "col2" = "b", "col3" = "c", 
"col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
"col9" = 0, "col10" = 99)
dt <- dt[,dt[1,] != 0]

This leaves us with dt as:

  col1 col2 col3 col4 col5 col6 col8 col10
1    a    b    c    d    e    9    7    99

To concatenate the remaining numeric columns (assuming they are all integers) into a new column, you can use lapply on the first row to get a logical vector indicating the columns with numbers. Then you can convert them to strings and paste them together in the new column.

numTag <- unlist(lapply(X = dt[1,], FUN = is.numeric))
dt$new_col <- rep(as.numeric(paste(as.character(dt[1,numTag]), collapse = '', sep = '')), nrow(dt))

Now dt looks like:

  col1 col2 col3 col4 col5 col6 col8 col10 new_col
1    a    b    c    d    e    9    7    99    9799

To do this without removing the zero-valued columns, the only necessary twist is to filter out the zeroes from our initial logical vector:

numTag <- unlist(lapply(X = dt[1,], FUN = is.numeric))
numTag <- numTag & (dt[1,] != 0)

dt$new_col <- rep(as.numeric(paste(as.character(dt[1,numTag]), collapse = '', sep = '')), nrow(dt))
dt

  col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 new_col
1    a    b    c    d    e    9    0    7    0    99    9799

Thank you for the solution. This seems to work great when the data is in a dataframe. However I am having difficulties getting it to work with a data.table — user3612324, Nov 23 '20 at 16:46

score 1 · Answer 3 · answered Nov 23 '20 at 16:06

library(data.table)
library(dplyr)
library(tidyr)

dt = data.table("col1" = "a", "col2" = "b", "col3" = "c", 
                "col4" = 'd', "col5" = "e", "col6" = 9, "col7" = 0, "col8" = 7,
                "col9" = 0, "col10" = 99)


## which rows contain zeros?
zero_vars <-  dt %>% 
  dplyr::select_if(~max(.x) == 0) %>% 
  colnames()


## which row contains non-zero numeric vars?
numeric_vars <- dt %>% 
  dplyr::select(-all_of(zero_vars)) %>% 
  dplyr::select_if(is.numeric) %>% 
  colnames()
                  

## creat new table 
collapsed_dt <- 
  dt %>% 
  dplyr::select(all_of(numeric_vars)) %>%   ## select only non-zero numeric vars
  mutate_all(as.character) %>% 
  unite( col = "collapsed_var", sep = "") ## unite them to new var 'collapsed_var'


## re-join the collapsed var to the original table
dt %>% 
  dplyr::select_if(is.character) %>% ## only character variables
  cbind(collapsed_dt) ## bind the collapsed_dt

Removing columns from a data.table in R based on conditions

3 Answers3

Linked