1

I have used the following code to run a regression:

res <- lm (c241 ~ x + matchcode , data = df)
summary(res)

Where matchcode is a variable which is a combination of an Iso3c-code and a year. For Russia 2005 it for example is RUS 2005 (see the first variable of the tibble). The idea is to use this matchcode as a fixed effect, like the lmabove. Applying thelm above works perfectly

As I have huge datasets (altogether over 4000 variables):

# A tibble: 450 x 546
   matchcode idstd year  country wt    region income industry sector ownership exporter c201  c202  c203a c203b c203c c203d c2041 c2042 c205a  c205b1 c205b2 c205b3 c205b4 c205b5 c205b6 c205b7 c205b8 c205b9 c205b10 c205c c205d c206a c206b c2071
   <chr+lbl> <dbl> <dbl> <chr+l> <dbl> <dbl+> <dbl+> <dbl+lb> <dbl+> <dbl+lbl> <dbl+lb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 "BGD 200~  2128 2002  Bangla~ NA    6      1       8       1      2         2        1988  4     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1    NA    2    
 2 "BGD 200~  2926 2002  Bangla~ NA    6      1       1       1      2         2        2000  1     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1    NA    NA    1    
 3 "BGD 200~  2931 2002  Bangla~ NA    6      1       1       1      2         1        1993  4     100     0   0     NA     2    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1    NA    NA    2    
 4 "BRA 200~ 15303 2003  Brazil~ NA    4      2       9       1      2         2        1946  2     100     0   0      0     2    NA     18.72  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       2     2     1     2    5    
 5 "BRA 200~ 14917 2003  Brazil~ NA    4      2       8       1      2         2        1984  2     100     0   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA      1       1     1     1     2    3    
 6 "BRA 200~ 14212 2003  Brazil~ NA    4      2      11       1      2         2        1998  2     100     0   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA      1       1     1     1     2    2    
 7 "KHM 200~ 16067 2003  Cambod~ NA    2      1      23       2      1         1        1993  4      50    50   0      0     2    NA    100.00  1     NA      1      1     NA     NA     NA     NA     NA     NA       1     1     1     1    1    
 8 "KHM 200~ 16233 2003  Cambod~ NA    2      1      10       4      2         2        1989  4     100     0   0      0     2    NA    100.00  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1     2    3    
 9 "KHM 200~ 16002 2003  Cambod~ NA    2      1       3       1      1         1        1990  5       0   100   0      0     2    NA     50.00  1     NA     NA     NA     NA     NA     NA     NA     NA     NA       1     1     1     1    1    
10 "CHN 200~ 17987 2002  China2~ NA    2      2       8       1      1         2        1993  6      55    45   0     NA    NA    NA        NA NA     NA     NA     NA     NA     NA     NA     NA     NA

I wanted to loop through the variables as follows LINK;

dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(y~ x, na.action=na.omit)))), .SDcols = -1]

This however creates a data.table NULL in data.table for the matchcode variable (as well as any other character variables).

Example of NULL

When I try to add matchcode to the regression:

dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(c241~ x + matchcode, na.action=na.omit)))), .SDcols = -1]

or I use the lapply with matchcode:

df <- lapply( df[,-1], function(x) summary(lm(df$c241~ x + df$matchcode)) )

it gives the following "famous" error :

 Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels 

Although I have read that this error can mean anything, my factor has indeed only one level, but this appeared to work fine in the single regression (also adding other variables, which are not characters, is fine). When using the the data.table loop or lapply it does not. My question is two-fold:

1) Why does the variable variable matchcode work in the first situation (res <- lm (c241 ~ x + matchcode , data = df) and not in the second dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(c241~ x + matchcode, na.action=na.omit)))), .SDcols = -1]?

2) What can I do to circumvent this? As the variable is paramount for the model.

Maybe I can convert the character variable, or I can recode it in some way?

UPDATE: I have used the code in this link: LINK to convert the characters into a factor with one level, which in the end resulted in the same error.

 ES1sample <- dput(head(ES1sample[, ],10))
structure(list(matchcode = structure(c("BGD 2002 ", "BRA 2003 ", 
"KHM 2003 ", "CHN 2002 ", "CHN 2003 ", "ECU 2003 ", "ERI 2002 ", 
"ETH 2002 ", "GTM 2003 ", "HND 2003 "), label = "", class = c("labelled", 
"character")), idstd = structure(c(2760, 14273, 16104, 17039, 
19095, 22207, 23063, 24046, 25420, 26212), label = "WEB STD FIRMID", format.stata = "%5.0f", class = c("labelled", 
"numeric")), year = structure(c(2002, 2003, 2003, 2002, 2003, 
2003, 2002, 2002, 2003, 2003), format.stata = "%9.0g", label = "", class = c("labelled", 
"numeric")), country = structure(c("Bangladesh2002", "Brazil2003", 
"Cambodia2003", "China2002", "China2003", "Ecuador2003", "Eritrea2002", 
"Ethiopia2002", "Guatemala2003", "Honduras2003"), label = "Country", format.stata = "%21s", class = c("labelled", 
"character")), wt = structure(c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), label = "locations and sectors weights", format.stata = "%9.0g", class = c("labelled", 
"numeric")), region = structure(c(6, 4, 2, 2, 2, 4, 1, 1, 4, 
4), label = "", class = c("labelled", "numeric")), income = structure(c(1, 
2, 1, 2, 2, 2, 1, 1, 2, 2), label = "income grouping for survey year", class = c("labelled", 
"numeric")), industry = structure(c(1, 1, 20, 3, 20, 12, 7, 3, 
NA, 3), label = "Industry", class = c("labelled", "numeric")), 
    sector = structure(c(1, 1, 2, 1, 2, 1, 1, 1, 2, 1), label = "Sector", class = c("labelled", 
    "numeric")), ownership = structure(c(2, 2, 2, 2, 2, 2, 2, 
    2, 1, 1), label = "Ownership", class = c("labelled", "numeric"
    )), exporter = structure(c(2, 2, 2, 2, 2, 1, 2, 2, 2, 1), label = "Export", class = c("labelled", 
    "numeric")), c201 = structure(c(1991, 1993, 1999, 1979, 1998, 
    1997, 1996, 1998, 1990, 1998), label = "Year firm began operations in this country", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c202 = structure(c(2, 2, 4, 6, 6, 2, 6, NA, 
    NA, NA), label = "Current legal status of firm", class = c("labelled", 
    "numeric")), c203a = structure(c(100, 100, 100, 100, 0, 100, 
    0, 100, 0, 0), label = "Percentage of firm owned by domestic private sector", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203b = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 
    100, 100), label = "Percentage of firm owned by foreign private sector", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203c = structure(c(0, 0, 0, 0, 100, 0, 0, 0, 
    0, 0), label = "Percentage of firm owned by government/state", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c203d = structure(c(NA, 0, 0, NA, NA, 0, 100, 
    0, 0, 0), label = "Percentage of firm owned by other types of owner", format.stata = "%8.2f", class = c("labelled", 
    "numeric")), c2041 = structure(c(2, 2, 2, NA, NA, 2, 2, 2, 
    2, 2), label = "Firm previously owned by government?", class = c("labelled", 
    "numeric")), c2042 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Year of privatization", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c205a = structure(c(NA, 25, 100, NA, 100, 100, 
    NA, NA, 100, 100), label = "Percentage owned by largest shareholder", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c205b1 = structure(c(NA, NA, NA, NA, NA, NA, 
    NA, NA, 1, 1), label = "Largest shareholder: individual", class = c("labelled", 
    "numeric")), c205b2 = structure(c(NA, 1, 1, NA, NA, 1, NA, 
    NA, NA, NA), label = "Largest shareholder: family", class = c("labelled", 
    "numeric")), c205b3 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: domestic company", class = c("labelled", 
    "numeric")), c205b4 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: foreign company", class = c("labelled", 
    "numeric")), c205b5 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: bank", class = c("labelled", 
    "numeric")), c205b6 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: investment fund", class = c("labelled", 
    "numeric")), c205b7 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: firm managers", class = c("labelled", 
    "numeric")), c205b8 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: firm employees", class = c("labelled", 
    "numeric")), c205b9 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: government", class = c("labelled", 
    "numeric")), c205b10 = structure(c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), label = "Largest shareholder: other", class = c("labelled", 
    "numeric")), c205c = structure(c(1, 1, 1, NA, 2, 1, 1, 1, 
    1, 1), label = "Is principal shareholder also the manager/director?", class = c("labelled", 
    "numeric")), c205d = structure(c(1, 1, 1, NA, NA, 2, 1, 1, 
    1, 1), label = "Is the principal owner male?", class = c("labelled", 
    "numeric")), c206a = structure(c(1, 1, 1, NA, NA, 1, 0, 1, 
    2, 3), label = "Number of separate operating facilities in this country", format.stata = "%4.0f", class = c("labelled", 
    "numeric")), c206b = structure(c(NA, 2, 2, NA, NA, 2, 2, 
    2, 2, 1), label = "Operations in other countries?", class = c("labelled", 
    "numeric")), c2071 = structure(c(1, 5, 1, 1, 2, 1, 5, 1, 
    1, 3), label = "Location of establishment", class = c("labelled", 
    "numeric")), c2072 = structure(c(NA, NA, NA, NA, NA, 1, 5, 
    NA, 1, 3), label = "Location of headquarters", class = c("labelled", 
    "numeric")), c208 = structure(c(NA, 1723, NA, NA, NA, NA, 
    NA, NA, NA, 1810), label = "Main product line", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c209a = structure(c(NA, 1, 2, NA, NA, NA, NA, 
    NA, NA, NA), label = "Other income generating activities", class = c("labelled", 
    "numeric")), c209ba = structure(c(NA, 0, NA, NA, NA, 100, 
    NA, NA, NA, NA), label = "Percent workers' time: manufacturing", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bb = structure(c(NA, 0, NA, NA, NA, 0, NA, 
    NA, NA, NA), label = "Percent workers' time: services", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bc = structure(c(NA, 0, NA, NA, NA, NA, 
    NA, NA, NA, NA), label = "Percent workers' time: commerce (retail/wholesale trade)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209bd = structure(c(NA, 0, NA, NA, NA, NA, 
    NA, NA, NA, NA), label = "Percent workers' time: construction", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c209be = structure(c(NA, 1, NA, NA, NA, 0, NA, 
    NA, NA, NA), label = "Percent workers' time: other", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c210a = structure(c(NA, 50, 1, NA, NA, NA, NA, 
    NA, 65, 0), label = "Main product line: firm's share of local market", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c210b = structure(c(12, 25, 1, 6, 44.6399993896484, 
    50, NA, 5, 45, 0), label = "Main product line: firm's share of national market", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c211a1 = structure(c(100, 100, 100, 99, 100, 
    90, 100, 100, 95, 0), label = "Percent of sales sold domestically", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211a2 = structure(c(0, 0, 0, 0, 0, 10, 0, 0, 
    5, 100), label = "Percent of sales exported directly", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211a3 = structure(c(0, 0, 0, 1, 0, 0, NA, 0, 
    0, 0), label = "Percent of sales exported indirectly", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c211b1 = structure(c(NA, 0, 0, 0, 0, 0, NA, 
    NA, NA, NA), label = "Percentage of domestic sales to government", format.stata = "%9.2f", class = c("labelled", 
    "numeric")), c282a2y = structure(c(451645, NA, NA, 49609, 
    1061449, 21, 38966.6171875, 43.0904998779297, NA, NA), label = "Total liabilities 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282b2y = structure(c(NA, NA, NA, 393, 81012, 
    1, 0, NA, NA, NA), label = "Long-term liabilities (>1 year) 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282c2y = structure(c(NA, NA, NA, 55193, 980437, 
    20, 7687.90576171875, NA, NA, NA), label = "Short-term liabilities (<1 year) 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282d2y = structure(c(NA, NA, 5500, 55193, 19194, 
    NA, 0, NA, NA, NA), label = "Payable short-term liabilities 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282e2y = structure(c(10000, NA, NA, 5000, 20329, 
    12, 29826.701171875, 40, NA, NA), label = "Equity–share capital 2 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282f2y = structure(c(305436, NA, 5500, 916, 
    NA, NA, 1452.00903320312, 6.09049987792969, NA, NA), label = "Retained earnings 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282a3y = structure(c(456618, NA, NA, NA, 1063217, 
    16, 39676.3984375, 43.7501029968262, NA, NA), label = "Total liabilities 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282b3y = structure(c(NA, NA, NA, NA, 152156, 
    5, 0, NA, NA, NA), label = "Long-term liabilities (>1 year) 3 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282c3y = structure(c(NA, NA, NA, NA, 911061, 
    11, 6299.255859375, NA, NA, NA), label = "Short-term liabilities (<1 year) 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282d3y = structure(c(NA, NA, NA, NA, 20964, 
    NA, 0, NA, NA, NA), label = "Payable short-term liabilities 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), c282e3y = structure(c(10000, NA, NA, NA, 124531, 
    9, 32368.564453125, 40, NA, NA), label = "Equity–share capital 3 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled", 
    "numeric")), c282f3y = structure(c(282840, NA, NA, NA, NA, 
    NA, 1008.58001708984, 6.75010013580322, NA, NA), label = "Retained earnings 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), gni = structure(c(370, 2760, 300, 970, 1100, 
    1830, 150, 100, 1910, 960), label = "Gross National Income per capita, Atlas Method (current ), World Development Ind", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), pop = structure(c(135683664, 176596256, 13403644, 
    1280400000, 1288400000, 13007942, 4296700, 67217840, 12307091, 
    6968512), label = "Population, Total, in 2005 (World Development Indicators)", format.stata = "%9.0g", class = c("labelled", 
    "numeric")), country_proper = structure(c("Bangladesh", "Brazil", 
    "Cambodia", "China", "China", "Ecuador", "Eritrea", "Ethiopia", 
    "Guatemala", "Honduras"), format.stata = "%22s", label = "", class = c("labelled", 
    "character")), c_abbr = structure(c("BGD", "BRA", "KHM", 
    "CHN", "CHN", "ECU", "ERI", "ETH", "GTM", "HND"), format.stata = "%9s", label = "", class = c("labelled", 
    "character")), countryyear = structure(c("Bangladesh2002", 
    "Brazil2003", "Cambodia2003", "China2002", "China2003", "Ecuador2003", 
    "Eritrea2002", "Ethiopia2002", "Guatemala2003", "Honduras2003"
    ), label = "Country", format.stata = "%21s", class = c("labelled", 
    "character")), iso3c = structure(c("BGD", "BRA", "KHM", "CHN", 
    "CHN", "ECU", "ERI", "ETH", "GTM", "HND"), label = "", class = c("labelled", 
    "character")), cname = structure(c("Bangladesh", "Brazil", 
    "Cambodia", "China", "China", "Ecuador", "Eritrea", "Ethiopia", 
    "Guatemala", "Honduras"), label = "", class = c("labelled", 
    "character")), cyear = structure(c("2002", "2003", "2003", 
    "2002", "2003", "2003", "2002", "2002", "2003", "2003"), label = "", class = c("labelled", 
    "character"))), .Names = c("matchcode", "idstd", "year", 
"country", "wt", "region", "income", "industry", "sector", "ownership", 
"exporter", "c201", "c202", "c203a", "c203b", "c203c", "c203d", 
"c2041", "c2042", "c205a", "c205b1", "c205b2", "c205b3", "c205b4", 
"c205b5", "c205b6", "c205b7", "c205b8", "c205b9", "c205b10", 
"c205c", "c205d", "c206a", "c206b", "c2071", "c2072", "c208", 
"c209a", "c209ba", "c209bb", "c209bc", "c209bd", "c209be", "c210a", 
"c210b", "c211a1", "c211a2", "c211a3", "c211b1", "c282a2y", "c282b2y", 
"c282c2y", "c282d2y", "c282e2y", "c282f2y", "c282a3y", "c282b3y", 
"c282c3y", "c282d3y", "c282e3y", "c282f3y", "gni", "pop", "country_proper", 
"c_abbr", "countryyear", "iso3c", "cname", "cyear"), class = c("data.table", 
"data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x0000000002570788>)
Tom
  • 2,173
  • 1
  • 17
  • 44
  • in the `lm (y ~ x + matchcode , data = df)` you should have `y` and `x` but i cannot find them in your dataset – Sal-laS Sep 05 '18 at 17:44
  • @Salman I used `y` to simplify the example (c241 would be the actual y). The `x` is just the function `x`. It applies every variable in a loop. I have regretfully not been able to test your answer yet. I will do so first thing in the morning. – Tom Sep 05 '18 at 19:38
  • I tried it with a single regression and I got the desired output (with all different matchcodes). I understand that lapply fails, and also that that is why data.table rows 1 and 4 produce NULL. I just want to know how to fix that. – Tom Sep 05 '18 at 22:03

1 Answers1

1

As i understand, you have tried to do a feature selection, but you have choosen a complicated method, and it made your output NULL. I agree, that you have so many feature and you need to make a feature selection before regression.

There are very prominent feature selection methods such as Random Forest. Random Forest helps you to detect the best predictors.

Given that i am interested to predict the Species of plants, but i dont know which feature can predict it the best (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width ). So the below code specify the best predicotrs:

library(party)
colnames(iris)
cf1 <- cforest(Species ~ . , data= iris, control=cforest_unbiased(mtry=2,ntree=50))

varimp(cf1)

The result of varimp() gives you:

A vector of ‘mean decrease in accuracy’ importance scores. In other word, the higher the score, probably it is better predictor.

In the example:

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
 0.047636364  0.002909091  0.354181818  0.227636364
Sal-laS
  • 11,016
  • 25
  • 99
  • 169
  • 1
    Thank you for your answer! It works! It is not ideal as it forces me to subset my data first (missing values in response variable not allowed), which in my case creates workspace memory issues. Would it be possible to tell the function to select only non NA values within the function, so I do not have to subset? In addition, could you elaborate a little bit on the `control=cforest_unbiased(mtry=2,ntree=50)` ? – Tom Sep 06 '18 at 08:13
  • @TomKisters i am happy that, it helps you. Regarding `NA`s, you should clean your data before starting to do anything. And removing or replacing the NAs is somehow part of this step as well. – Sal-laS Sep 06 '18 at 08:16
  • In an ideal world, yes. But I have multiple compliance variables within each database. Besides the lm function and many others deal with NA's without any problem. In my project removing NA's is not very desirable. – Tom Sep 06 '18 at 08:32
  • @TomKisters there should be methods for *replacing* the `NA`s as well. – Sal-laS Sep 06 '18 at 08:34