Update some columns conditionally with max value for all values in chosen columns (data.table, r)

Question

I have a dataset of the type 900,000 x 500, but the following shows a toy dataset apt for the question.

library(data.table)
df1 <- data.table(x = c(1,2,4,0), y = c(0,0,10,15), z = c(1,1,1,0))

I would like to do the following:

For columns y and z
select rows the value for which = 0
replace these with the max+1, where max is computed over the entire column

I am very new to data.table. Looking at examples of questions here at stackoverflow, I couldn't find a similar question, except this: How to replace NA values in a table *for selected columns*? data.frame, data.table

My own attempt is as follows, but this does not work:

for (col in c("x", "y")) df1[(get(col)) == 0, (col) := max(col) + 1)

Obviously, I haven't gotten accustomed to data.table, so I'm banging my head against the wall at the moment...

If anybody could provide a dplyr solution in addition to data.table, I would be thankful.

akrun · Accepted Answer · 2016-01-14T13:47:58.723

We can use set and assign the rows where the value is 0 with the max of that column +1.

 for(j in c("y", "z")){
    set(df1, i= which(!df1[[j]]), j=j, value= max(df1[[j]])+1)
 }

df1
#   x  y z
#1: 1 16 1
#2: 2 16 1
#3: 4 10 1
#4: 0 15 2

NOTE: The set method will be very efficient as the overhead of [.data.table is avoided

Or a less efficient method would be to specify the columns of interest in .SDcols, loop through the columns (lapply(..), replace the value based on the logical index, and assign (:=) the output back to the columns.

df1[, c('y', 'z') := lapply(.SD, function(x) 
         replace(x, !x, max(x)+1)), .SDcols= y:z]

thanks again for the help! I was trying to use `set` too, but couldn't help myself around its use. I'll accept this answer. — info_seekeR, Jan 14 '16 at 13:47

Stephen Henderson · Answer 2 · 2016-01-14T14:59:18.063

2

The dplyr version is pretty simple (I think)

> library(dplyr)
# indented for clarity
> mutate(df1, 
    y= ifelse(y>0, y, max(y)+1), 
    z= ifelse(z>0, z, max(z)+1))

  x  y z
1 1 16 1
2 2 16 1
3 4 10 1
4 0 15 2

EDIT As noted by David Arenburg in comments this is helpful for the toy example but not for the data mentione dwith 500 columns. He suggests something similar to:

df1 %>% mutate_each(funs(ifelse(. > 0, ., max(.) + 1)), -1)

where -1 specifies all but the first column

edited Jan 14 '16 at 14:59

answered Jan 14 '16 at 14:42

Stephen Henderson

6,340
3
27
33

1

Actually No. OP has 500 columns. "The `dplyr` way" would be something like `dt %>% mutate_each(funs(ifelse(. > 0, ., max(.) + 1)), y:z)` or similar. – David Arenburg Jan 14 '16 at 14:48
OK, that works too. I'll add it to answer with ref to you. – Stephen Henderson Jan 14 '16 at 14:53

score 1 · Answer 3 · answered Jan 14 '16 at 14:01

1

As an alternative, ifelse(test, yes, no) might be useful

Along the lines

library(data.table)
dt <- data.table(x = c(1,2,4,0), y = c(0,0,10,15), z = c(1,1,1,0))

print(dt)

dt[, y := ifelse(!y, max(y) + 1, y)]

print(dt)

answered Jan 14 '16 at 14:01

Severin Pappadeux

18,636
3
38
64

thanks. However, since I have almost 500 columns, it would become difficult to use ifelse for each column. Could `lapply` be used here for an easier `ifelse`? – info_seekeR Jan 14 '16 at 14:05
2

@info_seekeR Yes, `lapply` can be used here in the same way as in akrun's post. Because you say you have large data, you might want to consider this post that says `ifelse` can be slow (I'm not sure it applies to your case): http://stackoverflow.com/a/16275201/1191259 – Frank Jan 14 '16 at 14:08
1

@info_seekeR how to loop over columns is orthogonal to the question what to do with each column, so as @Frank said, `lapply` could be used in the same way as in @arkun's answer. Wrt being slow or not, you have ot measure. If speed is of an issue, first thing to do would be, IMHO, to compute `max(c)+1` separately, outside of the either `replace` or `ifelse` – Severin Pappadeux Jan 14 '16 at 14:23
2

If you already going this rout you could just do `dt[y == 0L, y := max(dt$y) + 1L]` in order to avoid `ifelse` all together (which adds nothing new to `replace` IMO). – David Arenburg Jan 14 '16 at 14:43

Update some columns conditionally with max value for all values in chosen columns (data.table, r)

3 Answers3