-1

I have a dataframe, in which some of the variables (columns) are factorial, when for some records I have missing values (NA).

Questions are:

  1. What is the correct approach of replacing\imputing NAs in factorial variables?

    e.g VarX with 4 Levels {"A", "B", "C", "D"} - What would be the preffered value to replace NAs with? A\B\C\D? Maybe just 0? Maybe impute with the level that is the majority for this variable observations?

  2. How to implement such imputation, based on answer to 1?

  3. Once 1&2 resolved, I'll use the following to create dummy variables for the factorial variables:

    is.fact <- sapply(my_data, is.factor)
    my_data.dummy_vars <- dummy.data.frame(my_data[, is.fact], sep = ".")
    

    Afterwards, how do I replace all the factorial variables in my_data with the dummy variables i've extracted into my_data.dummy_vars?

My use case is to calculate principal components afterwards (Which needs all variables to have numerical values, thus the dummy variables)

Thanks

Adiel
  • 1,203
  • 3
  • 18
  • 31
  • Can you provide a small example about how do you have and what you expect? How are you going to treat NA? I don't know if this is a duplicated question, for example see [this](http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-an-r-dataframe). – David Leal Feb 24 '17 at 17:08
  • Not sure how should I treat NAs of factorial variables. Is replacing them with 0s before conversion to dummy vars is a good idea? If so, I'd be happy to learn how to do so – Adiel Feb 24 '17 at 17:50
  • @DavidLeal See my post after the edit, I hope my intentions are more clear now. – Adiel Feb 24 '17 at 20:20

2 Answers2

1

Thanks for clarifying your intentions - that really helps! Here are my thoughts:

  1. Imputing missing data is a non-trivial problem, and maybe a good question for the fine folks at crossvalidated. This is a problem that can only really be addressed in the context of the project, by you (the subject-matter expert). A big question is whether missing values are missing at random, or as a function of some other variables, and whether these are observed or unobserved. If you conclude that they're missing as a function of other (observed) variables, you might even consider a model-based approach, perhaps using GLM. The easiest approach by far (and if you don't have many missing values) is to just delete these rows with something like mydata2 <- mydata[!is.na(TheFactorInQuestion),] I'll say it again, imputation of missing data is a non-trivial problem that should be considered carefully and in context. Perhaps a good approach is to try a few methods of imputation and see if (and how) your inferences change. If they don't change (much), you'll know you don't need to worry.

  2. Dropping rows instead could be done with a fairly simple mydata2 <- mydata[!is.na(TheFactorInQuestion),]. If you do any other form of imputation (in a sense, "making up" data), I'd advocate thinking long and hard about doing that before concluding that it's the right decision. And, of course, it might be.

  3. Joining two data.frames is pretty straightforward using cbind, something like my_data2 <- cbind(my_data, my_data.dummy_vars). If you need to remove the column with your factor data, my_data3 <- my_data2[,-5] if, for example, the factor data is in column 5.

Matt Tyers
  • 2,125
  • 1
  • 14
  • 23
  • Thanks! I'm afraid dropping rows is not an option for me (project constraint). I'll consult with crossvalidated guys for the correct approach - replacing with 0s or majority value. Could you help with how to code those? (replacing NAs with 0 \ with the level that is the majority level for each factor) – Adiel Feb 24 '17 at 22:58
  • Nevermind, I was introduced to mice() package in R which seem to do the thinking for me for each missing value column (in short..). Thank you – Adiel Feb 24 '17 at 23:51
0

By dummy variables, do you mean zeroes and ones? This is how I'd structure it:

# first building a fake data frame
x <- 1:10
y <- as.factor(c("A","A","B","B","C","C",NA,"A","B","C"))
df <- data.frame(x,y)

# creating dummy variables 
df$dummy_A <- 1*(y=="A")
df$dummy_B <- 1*(y=="B")
df$dummy_c <- 1*(y=="C")

# did it work?
df
    x    y dummy_A dummy_B dummy_c
1   1    A       1       0       0
2   2    A       1       0       0
3   3    B       0       1       0
4   4    B       0       1       0
5   5    C       0       0       1
6   6    C       0       0       1
7   7 <NA>      NA      NA      NA
8   8    A       1       0       0
9   9    B       0       1       0
10 10    C       0       0       1
Matt Tyers
  • 2,125
  • 1
  • 14
  • 23
  • I mean for example if a factorial variable has 4 levels, it'll be replaced by 5 dummy variables – Adiel Feb 24 '17 at 17:51
  • why would it need to be replaced by 5? what's the fifth case that would need a variable? – Matt Tyers Feb 24 '17 at 17:53
  • To my understanding variable with n levels is represented using n+1 dummy variables – Adiel Feb 24 '17 at 17:54
  • I don't understand what you mean. Really, only n-1 variables would be needed in order to contain the same amount of information. And depending on the analysis you're doing, coding a variable for each possible factor level (like suggested above) might result in an overparameterized model – Matt Tyers Feb 24 '17 at 17:59
  • Ill try to find the reference for my comment. This actually doesn't really matter because I'm using dummies library for that. My question is how to pre process (if at all) factorial variables that contain NA for some of their records? And hpw to replace the actual factorial variables in my data with those dummy variables, after i create them using dummies library? – Adiel Feb 24 '17 at 18:13
  • See my post after the edit, I hope my intentions are more clear now. – Adiel Feb 24 '17 at 20:20