-1

I know that there are more questions like this. However, I can not get it done. I have a very large data.table with about 800.000 rows. Every row contains some specific data, then some demand data for 2 years (from column 12 onward), and in the last column is a number. This number determines how many demand columns should be replaced.

Example with one row:

ITEM COUNTRY Q1 Q2 Q3 Q4 ---- Q24 NUMBER 
1          1  0  0  0  0 ----   2      2

If this is a vector called x for example, I would do this:

x[2:(2+x$NUMBER-1)] <- NA 

Now I want to do this for all rows. A for loop is way too slow. So I was thinking of apply.

fun_NA <- function(x){ #x is then a row of the data.table
#still use that with x[,] as x is still a data.table and not a vector 

    if(x[,60]>0){
        x[,12:(12+as.numeric(x[,60])-1)] <- NA  
    }
}

dt = apply(dt, 1, fun_NA)

where dtis my data.table, which has 60 columns. The first demand value starts in column 12. It does not work.. Now it gives the error "Error in x[,60] : incorrect number of dimensions", but I tried other ways too and received other errors.

Reason why I want to do this: The number represent the month (after first month of my data set) when the item was able to sell, so in the months before, the demand was not 0, but just did not exist. I need it to be NA, as 0 will cause wrong calculations later on.

EDIT: Removed the comma's so this is the new code,

fun_NA <- function(x){ #x is then a row of the data.table

        if(x[60]>0){
            x[12:(12+as.numeric(x[60])-1)] <- NA    
        }
    }

dt = apply(dt, 1, fun_NA)

However, this returns a large list with all NULL and NA elements..

Edit: the head of the data-table is as follows: (dput)

structure(list(ITEM = c(1, 1, 2, 2, 2, 2), COUNTRY = c(1, 2, 
3, 4, 5, 2), DATE = c("2015-02-02", "2015-02-02", "2014-09-27", 
"2014-09-27", "2014-09-27", "2014-09-27"), q_1 = c(0, 0, 2, 0, 
0, 133), q_2 = c(0, 0, 24, 0, 9, 119), q_3 = c(0, 0, 15, 0, 13, 
121), q_4 = c(0, 0, 7, 0, 2, 51), q_5 = c(0, 0, 12, 0, 6, 59), 
    q_6 = c(0, 0, 3, 0, 0, 36), q_7 = c(0, 0, 6, 0, 6, 41), q_8 = c(0, 
    0, 19, 0, 4, 42), q_9 = c(0, 0, 3, 0, 5, 48), q_10 = c(0, 
    0, 5, 0, 11, 49), q_11 = c(0, 0, 6, 0, 1, 42), q_12 = c(0, 
    0, 0, 0, 8, 70), q_13 = c(0, 0, 1, 0, 19, 81), q_14 = c(0, 
    0, 5, 0, 98, 86), q_15 = c(0, 0, 12, 0, 10, 152), q_16 = c(0, 
    0, 7, 0, 8, 95), q_17 = c(0, 0, 30, 0, 5, 62), q_18 = c(0, 
    0, 6, 0, 10, 47), q_19 = c(0, 0, 7, 0, 1, 35), q_20 = c(2, 
    0, 7, 0, 0, 47), q_21 = c(0, 2, 16, 5, 4, 70), q_22 = c(0, 
    0, 7, 0, 7, 46), q_23 = c(0, 0, 8, 0, 79, 20), q_24 = c(0, 
    0, 5, 0, 26, 45), NUMBER = c(13, 13, 8, 8, 8, 8)), .Names = c("ITEM", 
"COUNTRY", "DATE", "q_1", "q_2", "q_3", "q_4", "q_5", "q_6", 
"q_7", "q_8", "q_9", "q_10", "q_11", "q_12", "q_13", "q_14", 
"q_15", "q_16", "q_17", "q_18", "q_19", "q_20", "q_21", "q_22", 
"q_23", "q_24", "NUMBER"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000004490788>)
pk_22
  • 288
  • 1
  • 2
  • 18
  • Please read and follow basic posting guidelines https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 – Frank Sep 20 '17 at 15:24

2 Answers2

1

Usually, it is more efficient to work column-wise.

lapply(1:24, function(i) dt[i <= NUMBER, (paste0("q_", i)) := NA])
   ITEM COUNTRY       DATE q_1 q_2 q_3 q_4 q_5 q_6 q_7 q_8 q_9 q_10 q_11 q_12 q_13 q_14 q_15 q_16 q_17 q_18 q_19 q_20
1:    1       1 2015-02-02  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA    0    0    0    0    0    0    2
2:    1       2 2015-02-02  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA    0    0    0    0    0    0    0
3:    2       3 2014-09-27  NA  NA  NA  NA  NA  NA  NA  NA   3    5    6    0    1    5   12    7   30    6    7    7
4:    2       4 2014-09-27  NA  NA  NA  NA  NA  NA  NA  NA   0    0    0    0    0    0    0    0    0    0    0    0
5:    2       5 2014-09-27  NA  NA  NA  NA  NA  NA  NA  NA   5   11    1    8   19   98   10    8    5   10    1    0
6:    2       2 2014-09-27  NA  NA  NA  NA  NA  NA  NA  NA  48   49   42   70   81   86  152   95   62   47   35   47
   q_21 q_22 q_23 q_24 NUMBER
1:    0    0    0    0     13
2:    2    0    0    0     13
3:   16    7    8    5      8
4:    5    0    0    0      8
5:    4    7   79   26      8
6:   70   46   20   45      8

Explanation

We loop over all 24 columns. For each row it is tested whether the value in the current column needs to be replaced by NA according to its column number. The values are updated in place which saves time and memory.

I've tested the solution with a sample data set of 1 million rows which took less than 0.2 seconds.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

Since your input to the function is a vector, you should get rid of all the commas, because it only has one dimension.

fun_NA <- function(x){ #x is then a row of the data.table
#still use that with x[,] as x is still a data.table and not a vector 

    if(x[60]>0){
        x[12:(12+as.numeric(x[60])-1)] <- NA  
    }
}

dt = apply(dt, 1, fun_NA)
csgroen
  • 2,511
  • 11
  • 28
  • Yeah I had that first, but that does not work as I try it with the first row as an example. I tried that with x=dt[1,], and then class(x) results in "data.table" "data.frame", and I have to call the second element of x by x[,2] – pk_22 Sep 20 '17 at 14:57
  • see also my comment in the function. – pk_22 Sep 20 '17 at 14:57
  • Okay, I was wrong here. In the apply it apparently needs to be without a comma. It now takes a few second to terminate, without any error. However, than I get a very large list with only NULL, so it does not work properly ... – pk_22 Sep 20 '17 at 15:01
  • Hm, I think the problem might be in the function then... Can you dput(head(x)) so I can test it? – csgroen Sep 20 '17 at 15:20
  • I added the head of a reduced version of my data.table (so lesser columns and the function fun_NA should start at column 4 with replacing now, instead of 12, based on the number in the last column. – pk_22 Sep 21 '17 at 07:37