2

I have a very extensive dataframe with the following built up:

character    factor  labelled   numeric 
        6         1       945         2 

Where the labelled are from the haven package (Stata import) and function as factors. Please see some example data below:

matchcode S001  S002  S003  S003A S004  S006  S007  S007_01    S009  S009A S010  S011  S012  S013  S013B S016  S017      S017A     S018      S018A     S019      S019A     S020  S021       
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+lbl>  
 1 "JPN 198~ 2     1     392   392   NA     494   494  3920120494 JP    JP    NA    NA    NA    NA    NA    NA    1.0897217 1.0897217 0.9050845 0.9050845 1.3576267 1.3576267 1981  39201211981
 2 "JPN 198~ 2     1     392   392   NA     115   115  3920120115 JP    JP    NA    NA    NA    NA    NA    NA    0.6789805 0.6789805 0.5639373 0.5639373 0.8459059 0.8459059 1981  39201211981
 3 "JPN 198~ 2     1     392   392   NA     949   949  3920120949 JP    JP    NA    NA    NA    NA    NA    NA    1.0897217 1.0897217 0.9050845 0.9050845 1.3576267 1.3576267 1981  39201211981
 4 "MEX 198~ 2     1     484   484   NA     112  1315  4840120111 MX    MX    NA    NA    NA    NA    NA    NA    0.7965188 0.7965188 0.4335976 0.4335976 0.6503964 0.6503964 1981  48401211981
 5 "MEX 198~ 2     1     484   484   NA    1042  2238  4840121034 MX    MX    NA    NA    NA    NA    NA    NA    1.1378840 1.1378840 0.6194252 0.6194252 0.9291378 0.9291378 1981  48401211981
 6 "MEX 198~ 2     1     484   484   NA    1315  2510  4840121306 MX    MX    NA    NA    NA    NA    NA    NA    1.1378840 1.1378840 0.6194252 0.6194252 0.9291378 0.9291378 1981  48401211981
 7 "HUN 198~ 2     1     348   348   NA     250  3291  3480120250 HU    HU    NA    NA    NA    NA    NA    NA    1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982  34801211982
 8 "HUN 198~ 2     1     348   348   NA     943  3984  3480120943 HU    HU    NA    NA    NA    NA    NA    NA    1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982  34801211982
 9 "HUN 198~ 2     1     348   348   NA     726  3767  3480120726 HU    HU    NA    NA    NA    NA    NA    NA    1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982  34801211982
10 "AUS 198~ 2     1      36    36   NA     342  4847   360120342 AU    AU    NA    NA    NA    NA    NA    NA    0.9616138 0.9616138 0.7830731 0.7830731 1.1746096 1.1746096 1981   3601211981

I converted the negative numbers in the dataset to NA (which they are);

df[df < 0] <- NA
df<- df[,colMeans(is.na(df)) <= 0.999]

I wanted to convert all factors to numeric (in order to be able to take the mean of each value later) by using:

as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df[] = lapply(df, as.numeric.factor)

This initially worked. However, after replacing all negative numbers with NA's it no longer did and everything became NA. It seems that the function could have trouble dealing with NA's? If so, how do I deal with it?

The idea is to ultimately summarize (take the mean) of each variable per country-year:

cols = sapply(WVS, is.numeric)
cols = names(cols)[cols]
dfclevel= df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]

In the end I tried to switch it around to circumvent the NA's;

df <- as.data.frame(df)
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df[] = lapply(df, as.numeric.factor) 
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
df[df < 0] <- NA
df <- df[,colMeans(is.na(df)) <= 0.999]
df <- data.table(df)
dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]

but then I get:

> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Error in `[.data.frame`(df, , lapply(.SD, mean, na.rm = TRUE),  : 
  unused arguments (.SDcols = cols, by = matchcode)
> df <- data.table(df)
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Error in `[.data.table`(df, , lapply(.SD, mean, na.rm = TRUE),  : 
  Some items of .SDcols are not column names (or are NA)

I tried without .SDcols=cols, then I get:

> df <- as.data.frame(df)
> as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
> df[] = lapply(df, as.numeric.factor) 
Error in `[<-.data.frame`(`*tmp*`, , value = list(matchcode = c(NA_real_,  : 
  replacement element 6 has 717 rows, need 720
In addition: Warning message:
In FUN(X[[i]], ...) : NAs introduced by coercion
> df <- data.table(df)
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), by=matchcode]
Error in gmean(S009, na.rm = TRUE) : 
  Type 'character' not supported by GForce mean (gmean) na.rm=TRUE. Either add the prefix base::mean(.) or turn off GForce optimization using options(datatable.optimize=1)

I have been struggling with this issue for so long, some help would truly be appreciated.

Tom
  • 2,173
  • 1
  • 17
  • 44
  • I think `as.numeric(yourfactor)` will return the numeric level of each element. You could try `as.numeric(as.character(yourfactor))` – rrr Sep 01 '18 at 15:57
  • 1
    Where did I use `as.numeric(yourfactor)`? According to the following link, it it should be equivalent and faster to use `as.numeric(levels(f))[f]`. https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-integer-numeric-without-loss-of-information – Tom Sep 03 '18 at 06:36
  • @rrr It turns out you were right (although I think maybe for the wrong reason?) In any case, the `as.numeric(as.character(x))` appear to deal with NA cases which `as.numeric(levels(f))[f]` is unable to handle. – Tom Sep 03 '18 at 13:49

1 Answers1

0

Sticking to the last approach of the OP; the function used to convert the NA's needs to be replaced with an, albeit less efficient, function that can deal with NA's, which is;

as.numeric(as.character(x))

The code then becomes:

df <- as.data.frame(df)
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
df[] = lapply(df, as.numeric.factor) 
df[df < 0] <- NA
df <- df[,colMeans(is.na(df)) <= 0.999]
df <- data.table(df)
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Tom
  • 2,173
  • 1
  • 17
  • 44