1

I have a data table with the structure as given below:

structure(list(GVKEY1 = c(2721, 113609, 62634, NA, 62599, 15855, 
15855, NA, NA, NA), GVKEY2 = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), GVKEY3 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), GVKEY4 = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), GVKEY5 = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
)), .Names = c("GVKEY1", "GVKEY2", "GVKEY3", "GVKEY4", "GVKEY5"
), class = c("data.table", "data.frame"), row.names = c(NA, -10L
))

I want to create a new column which is the maximum value of all the five columns. Notice that the data has a lot of NAs.

I wrote the following line

patent <- patent[, GVKEY := lapply(.SD, max, na.rm = TRUE), .SDcols = c('GVKEY1', 'GVKEY2', 'GVKEY3', 'GVKEY4', 'GVKEY5')]

I get the following output.

Warning messages:
1: In [.data.table(patent, , :=(GVKEY, lapply(.SD, max, na.rm = TRUE)), :
Supplied 5 items to be assigned to 3280338 items of column 'GVKEY' (recycled leaving remainder of 3 items).
2: In [.data.table(patent, , :=(GVKEY, lapply(.SD, max, na.rm = TRUE)), :
Coerced 'list' RHS to 'double' to match the column's type. Either change the target column to 'list' first (by creating a new 'list' vector length 3280338 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'double' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

Not sure what I am doing wrong. It would be great if someone can help me.

nicola
  • 24,005
  • 3
  • 35
  • 56
Sumit
  • 2,242
  • 4
  • 25
  • 43

2 Answers2

5

You can use the vectorized pmax function combined with do.call in order to avoid by row operation. This is uses the fact that data.table and data.frame are both lists (check out is.list(patent)), compared to apply which converts the data.table to a matrix. Also, as mentioned in the comments, the := updates by reference so please don't use it combined with the <- operator

patent[, GVKEY := do.call(pmax, c(na.rm = TRUE, .SD))]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • very nice. I always forget about `pmax`, and I use `max( )` quite a lot – SymbolixAU Feb 28 '16 at 08:50
  • Can you please explain, how to write this function if I don't want to use all the five columns to calculate max. Let say max of GVKEY1, GVKEY2, and GVKEY3. Thanks. – Sumit Feb 29 '16 at 00:54
  • Add `.SDcols = c('GVKEY1', 'GVKEY2', 'GVKEY3', 'GVKEY4', 'GVKEY5')` like you already doing. Or you can do it shortly using `.SDcols = paste0('GVKEY', 1:5)` – David Arenburg Feb 29 '16 at 07:46
1

A slight change for you, apply instead of lapply, and when using :=, you don't use the <- assignment operator*.

patent[, GVKEY := apply(.SD, 1, max, na.rm = TRUE), .SDcols = c('GVKEY1', 'GVKEY2', 'GVKEY3', 'GVKEY4', 'GVKEY5')]

*References:

Community
  • 1
  • 1
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139