In my dataset:
# A tibble: 240 x 1,415
matchcode S001 S002 S002EVS S003 S003A S004 S006 S007 S007_01 S008 S009 S009A S010 S010_01 S010_02 S010_03 S010_04 S011 S012 S013 S013B S014 S015 S016 S017 S017A
<fct> <dbl> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl> <fct> <fct> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
1 "JPN 198~ 2 1 -4 392 392 -4 324 324 3920120324 -4 JP JP -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0.6789805 0.6789805
2 "MEX 198~ 2 1 -4 484 484 -4 933 2130 4840120926 -4 MX MX -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.1378840 1.1378840
3 "HUN 198~ 2 1 -4 348 348 -4 1280 4321 3480121280 -4 HU HU -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0635516 1.0635516
4 "AUS 198~ 2 1 -4 36 36 -4 973 5478 360120973 -4 AU AU -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0.9616138 0.9616138
5 "ARG 198~ 2 1 -4 32 32 -4 874 6607 320120874 -4 AR AR -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0.9266260 0.9266260
6 "FIN 198~ 2 1 -4 246 246 -4 385 7123 2460120385 -4 FI FI -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0000000 1.0000000
7 "KOR 198~ 2 1 -4 410 410 -4 3 7744 4100120003 -4 KR KR -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0000000 1.0000000
8 "ZAF 198~ 2 1 -4 710 710 -4 5420 10260 7100121549 -4 ZA ZA -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0000000 1.0000000
9 "ARG 199~ 2 2 -4 32 32 -4 856 11163 320240856 -4 AR AR 125 -4 -4 -4 -4 1210 -4 1 -4 -4 -4 -4 1.0000000 1.0000000
10 "BLR 199~ 2 2 -4 112 112 -4 106 11415 1120240106 -4 BY BY -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1.0000000 1.0000000
to replace all negative values with NA's, I used the following code:
df [ df < 0 ] <- NA
I however only want to have this operation carried out on columns that are not characters (I want to get rid of the error messages, without suppressing them). The variable charcol
holds the names of the columns that should be skipped. I tried:
df [-charcol] df [-charcol] < 0] <- NA
Which gave me the error:
Error: cannot allocate vector of size 1.8 Gb
In addition to still giving me the warnings:
In addition: Warning messages:
1: In Ops.factor(left, right) : ‘<’ not meaningful for factors
Although I probably got the syntax wrong, I am wondering what would be the most efficient solution for such problems for large datasets. I have been looking at the data.table vignette for a while, but I cannot really figure out how to do the syntax.
Any suggestions?
str(WVSsample)
Classes ‘data.table’ and 'data.frame': 240 obs. of 1415 variables:
$ matchcode : Factor w/ 240 levels "ALB 1998 ","ALB 2002 ",..: 108 134 88 12 4 73 117 232 5 25 ...
$ S001 :Class 'labelled' atomic [1:240] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Study"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
.. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S002 :Class 'labelled' atomic [1:240] 1 1 1 1 1 1 1 1 2 2 ...
.. ..- attr(*, "label")= chr "Wave"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:11] -5 -4 -3 -2 -1 1 2 3 4 5 ...
.. .. ..- attr(*, "names")= chr [1:11] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S002EVS :Class 'labelled' atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
.. ..- attr(*, "label")= chr "EVS-wave"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:9] -5 -4 -3 -2 -1 1 2 3 4
.. .. ..- attr(*, "names")= chr [1:9] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S003 :Class 'labelled' atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
.. ..- attr(*, "label")= chr "Country/region"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
.. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S003A :Class 'labelled' atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
.. ..- attr(*, "label")= chr "Country/regions [with split ups]"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
.. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
$ S004 :Class 'labelled' atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
.. ..- attr(*, "label")= chr "Set"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
.. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
EDIT: @chinsoon12 mentioned using the following piece of code:
f_dowle3 = function(DT) {
for (j in seq_len(ncol(DT)))
set(DT,which(is.na(DT[[j]])),j,0)
}
This code however does not do two things:
It replaces NA's with zero, while I want to replace negative values with NA's. I need to change the
which(is.na(DT[[j]]))
part to something likeDT[[j]]) < 0
.It does not account for character columns.
I changed the code to:
f_dowle3 = function(DT) {
# or by number (slightly faster than by name) :
for (j in seq_len(ncol(DT)))
set(DT,which(DT[[j]]<0),j,NA)
}
But this makes the dataset NULL. Could anyone help me with adapting the code properly?