I have a very extensive dataframe with the following built up:
character factor labelled numeric
6 1 945 2
Where the labelled are from the haven
package (Stata import) and function as factors. Please see some example data below:
matchcode S001 S002 S003 S003A S004 S006 S007 S007_01 S009 S009A S010 S011 S012 S013 S013B S016 S017 S017A S018 S018A S019 S019A S020 S021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+lbl>
1 "JPN 198~ 2 1 392 392 NA 494 494 3920120494 JP JP NA NA NA NA NA NA 1.0897217 1.0897217 0.9050845 0.9050845 1.3576267 1.3576267 1981 39201211981
2 "JPN 198~ 2 1 392 392 NA 115 115 3920120115 JP JP NA NA NA NA NA NA 0.6789805 0.6789805 0.5639373 0.5639373 0.8459059 0.8459059 1981 39201211981
3 "JPN 198~ 2 1 392 392 NA 949 949 3920120949 JP JP NA NA NA NA NA NA 1.0897217 1.0897217 0.9050845 0.9050845 1.3576267 1.3576267 1981 39201211981
4 "MEX 198~ 2 1 484 484 NA 112 1315 4840120111 MX MX NA NA NA NA NA NA 0.7965188 0.7965188 0.4335976 0.4335976 0.6503964 0.6503964 1981 48401211981
5 "MEX 198~ 2 1 484 484 NA 1042 2238 4840121034 MX MX NA NA NA NA NA NA 1.1378840 1.1378840 0.6194252 0.6194252 0.9291378 0.9291378 1981 48401211981
6 "MEX 198~ 2 1 484 484 NA 1315 2510 4840121306 MX MX NA NA NA NA NA NA 1.1378840 1.1378840 0.6194252 0.6194252 0.9291378 0.9291378 1981 48401211981
7 "HUN 198~ 2 1 348 348 NA 250 3291 3480120250 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
8 "HUN 198~ 2 1 348 348 NA 943 3984 3480120943 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
9 "HUN 198~ 2 1 348 348 NA 726 3767 3480120726 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
10 "AUS 198~ 2 1 36 36 NA 342 4847 360120342 AU AU NA NA NA NA NA NA 0.9616138 0.9616138 0.7830731 0.7830731 1.1746096 1.1746096 1981 3601211981
I converted the negative numbers in the dataset to NA (which they are);
df[df < 0] <- NA
df<- df[,colMeans(is.na(df)) <= 0.999]
I wanted to convert all factors to numeric (in order to be able to take the mean of each value later) by using:
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df[] = lapply(df, as.numeric.factor)
This initially worked. However, after replacing all negative numbers with NA's it no longer did and everything became NA. It seems that the function could have trouble dealing with NA's? If so, how do I deal with it?
The idea is to ultimately summarize (take the mean) of each variable per country-year:
cols = sapply(WVS, is.numeric)
cols = names(cols)[cols]
dfclevel= df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
In the end I tried to switch it around to circumvent the NA's;
df <- as.data.frame(df)
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df[] = lapply(df, as.numeric.factor)
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
df[df < 0] <- NA
df <- df[,colMeans(is.na(df)) <= 0.999]
df <- data.table(df)
dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
but then I get:
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Error in `[.data.frame`(df, , lapply(.SD, mean, na.rm = TRUE), :
unused arguments (.SDcols = cols, by = matchcode)
> df <- data.table(df)
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Error in `[.data.table`(df, , lapply(.SD, mean, na.rm = TRUE), :
Some items of .SDcols are not column names (or are NA)
I tried without .SDcols=cols
, then I get:
> df <- as.data.frame(df)
> as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
> df[] = lapply(df, as.numeric.factor)
Error in `[<-.data.frame`(`*tmp*`, , value = list(matchcode = c(NA_real_, :
replacement element 6 has 717 rows, need 720
In addition: Warning message:
In FUN(X[[i]], ...) : NAs introduced by coercion
> df <- data.table(df)
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), by=matchcode]
Error in gmean(S009, na.rm = TRUE) :
Type 'character' not supported by GForce mean (gmean) na.rm=TRUE. Either add the prefix base::mean(.) or turn off GForce optimization using options(datatable.optimize=1)
I have been struggling with this issue for so long, some help would truly be appreciated.