Efficiently replacing negative values in large datasets conditional on type

Question

In my dataset:

# A tibble: 240 x 1,415
   matchcode S001  S002  S002EVS S003  S003A S004  S006  S007  S007_01    S008  S009  S009A S010  S010_01 S010_02 S010_03 S010_04 S011  S012  S013  S013B S014  S015  S016  S017      S017A    
   <fct>     <dbl> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl>  <dbl> <fct> <fct> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
 1 "JPN 198~ 2     1     -4      392   392   -4     324    324 3920120324 -4    JP    JP     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.6789805 0.6789805
 2 "MEX 198~ 2     1     -4      484   484   -4     933   2130 4840120926 -4    MX    MX     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.1378840 1.1378840
 3 "HUN 198~ 2     1     -4      348   348   -4    1280   4321 3480121280 -4    HU    HU     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0635516 1.0635516
 4 "AUS 198~ 2     1     -4       36    36   -4     973   5478  360120973 -4    AU    AU     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.9616138 0.9616138
 5 "ARG 198~ 2     1     -4       32    32   -4     874   6607  320120874 -4    AR    AR     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.9266260 0.9266260
 6 "FIN 198~ 2     1     -4      246   246   -4     385   7123 2460120385 -4    FI    FI     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 7 "KOR 198~ 2     1     -4      410   410   -4       3   7744 4100120003 -4    KR    KR     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 8 "ZAF 198~ 2     1     -4      710   710   -4    5420  10260 7100121549 -4    ZA    ZA     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 9 "ARG 199~ 2     2     -4       32    32   -4     856  11163  320240856 -4    AR    AR    125   -4      -4      -4      -4      1210  -4     1    -4    -4    -4    -4    1.0000000 1.0000000
10 "BLR 199~ 2     2     -4      112   112   -4     106  11415 1120240106 -4    BY    BY     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000

to replace all negative values with NA's, I used the following code:

df [ df < 0 ] <- NA

I however only want to have this operation carried out on columns that are not characters (I want to get rid of the error messages, without suppressing them). The variable charcol holds the names of the columns that should be skipped. I tried:

df [-charcol] df [-charcol] < 0] <- NA

Which gave me the error:

Error: cannot allocate vector of size 1.8 Gb

In addition to still giving me the warnings:

In addition: Warning messages:
1: In Ops.factor(left, right) : ‘<’ not meaningful for factors

Although I probably got the syntax wrong, I am wondering what would be the most efficient solution for such problems for large datasets. I have been looking at the data.table vignette for a while, but I cannot really figure out how to do the syntax.

Any suggestions?

str(WVSsample)
Classes ‘data.table’ and 'data.frame':  240 obs. of  1415 variables:
 $ matchcode  : Factor w/ 240 levels "ALB 1998 ","ALB 2002 ",..: 108 134 88 12 4 73 117 232 5 25 ...
 $ S001       :Class 'labelled'  atomic [1:240] 2 2 2 2 2 2 2 2 2 2 ...
  .. ..- attr(*, "label")= chr "Study"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
  .. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S002       :Class 'labelled'  atomic [1:240] 1 1 1 1 1 1 1 1 2 2 ...
  .. ..- attr(*, "label")= chr "Wave"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:11] -5 -4 -3 -2 -1 1 2 3 4 5 ...
  .. .. ..- attr(*, "names")= chr [1:11] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S002EVS    :Class 'labelled'  atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
  .. ..- attr(*, "label")= chr "EVS-wave"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:9] -5 -4 -3 -2 -1 1 2 3 4
  .. .. ..- attr(*, "names")= chr [1:9] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S003       :Class 'labelled'  atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
  .. ..- attr(*, "label")= chr "Country/region"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
  .. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S003A      :Class 'labelled'  atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
  .. ..- attr(*, "label")= chr "Country/regions [with split ups]"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
  .. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S004       :Class 'labelled'  atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
  .. ..- attr(*, "label")= chr "Set"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
  .. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...

EDIT: @chinsoon12 mentioned using the following piece of code:

f_dowle3 = function(DT) {
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

This code however does not do two things:

It replaces NA's with zero, while I want to replace negative values with NA's. I need to change the which(is.na(DT[[j]])) part to something likeDT[[j]]) < 0.
It does not account for character columns.

I changed the code to:

f_dowle3 = function(DT) {
  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(DT[[j]]<0),j,NA)
}

But this makes the dataset NULL. Could anyone help me with adapting the code properly?

see https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table, in a nutshell, use `data.table::set` function — chinsoon12, Sep 18 '18 at 09:06
@chinsoon12 I tried to change the piece of code to fit my problem (see EDIT), but I did not get very far. Do you have any idea how to adapt it? — Tom, Sep 18 '18 at 10:26

chinsoon12 · Accepted Answer · 2018-09-19T00:21:21.990

1

Since this is a dupe, will delete shortly as cannot fit in comments.

setDT(df)
cols <- names(df)[sapply(df, is.numeric)]
for (x in cols) {
    set(df, which(df[[x]] < 0), x, NA_real_)
}

edited Sep 19 '18 at 00:21

answered Sep 18 '18 at 10:32

chinsoon12

25,005
4
25
35

I am trying it now. Maybe it's better if I rewrite the question to which you gave the answer? – Tom Sep 18 '18 at 10:49
It really is a dupe. I just can’t seem to find the relevant one in google – chinsoon12 Sep 18 '18 at 10:57
Ah okay, I thought you meant a duplicate of the link you sent. In any case, thank you so much for helping out, I really appreciate it! – Tom Sep 18 '18 at 10:58
Small additional question. I assumed your code worked, because the result was a database of the same dimensions without giving an error. Weirdly enough, when I actually try to view the data, it gives me the error: `Error: cannot allocate vector of size 2.6 Mb`. The old database of the same dimensions works perfectly and only when I remove everything else in my workspace the problem goes away. Any idea why that could be? – Tom Sep 18 '18 at 12:34
What R architecture are you using? 32 or 64bit? How much RAM do you have? Are you able to reproduce the error consistently? If yes, you will need to share the data so that I can reproduce on my end – chinsoon12 Sep 18 '18 at 13:56
I am running RStudio on Windows (which shows 32-bit), but I believe that it still runs R in 64bit right? I have 16GB of RAM. `memory.limit() [1] 16201` – Tom Sep 18 '18 at 14:07
Use sessionInfo() to check. You can point to the 64bit R version from the RStudio settings – chinsoon12 Sep 18 '18 at 14:08
It's indeed running 64-bit. `R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)` – Tom Sep 18 '18 at 14:12
Hmmm that is a out of memory error. You might need to profile your memory usage to identify the problem. – chinsoon12 Sep 18 '18 at 14:13
1

borrowed from [David arenburg's comment](https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table#comment61739776_31527286): something like `df[,names(df) := lapply(.SD,function(x) { if (is.numeric(x) { x[x<0] <- NA_Real_ }; x })]` should do in one pass – Tensibai Sep 18 '18 at 14:51
1

Fyi, should probably get in the habit of using `sapply(df, is.numeric)` for things like that (in your definition of cols), judging by the comments by Matt and Kevin here: https://stackoverflow.com/a/18835813/ – Frank Sep 18 '18 at 15:40
@Tensibai I believe there is a `)` bracket missing in your comment, but I cannot really figure out where it should go. – Tom Sep 26 '18 at 13:24
1

@Tom just after is.numeric(x), I forgot to close the if condition: `if (is.numeric(x)) ` <- like this. Sorry – Tensibai Sep 26 '18 at 13:26
Thanks! I thought it should go there, but I still got an error. On second thought I think I forgot to convert to data.table before applying it. – Tom Sep 26 '18 at 13:30

Efficiently replacing negative values in large datasets conditional on type

1 Answers1

Linked