-1

I have a dataset with 77 columns and some of them have categorical values (for ex. the column 'Sexual Orientation' can have one of the following values: 'Heterosexual', 'Homosexual', 'Bisexual', 'Other', 'NA'). (I also have some NAs that I will impute after I will have reshaped my data frame).

I want to transform this dataset into a dataset that only have binary values. So, for example, I want the above column to be divided into 4 different columns that say:

Heterosexual    Homosexual    Bisexual     Other
1               0             0            0

or, if I have an NA row, I want it to be represented as following:

Heterosexual    Homosexual    Bisexual     Other
NA              NA            NA           NA

Also, I have 'binary' variables like "Gender" (I only have the values 'Male' and 'Female') and I want this column to be split into two different columns like this:

Male   Female
0      1

or, in the case of NA:

Male   Female
NA     NA

Is there a function that I can use to do this? My professor told me that the function 'reshape' could help me doing it but I have had some troubles using it and I don't think it would work.

Could you please give me any advice? thank you in advance

pogibas
  • 27,303
  • 19
  • 84
  • 117
Msms
  • 3
  • 1
  • 1
    Welcome to stackoverflow! Can you edit your question according to [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so that other users can help you (try using `dput` function). – pogibas Apr 20 '19 at 10:07
  • Hi, I modified it but I haven't included any code (because it is not necessary for my question). I hope it's okay like this. – Msms Apr 20 '19 at 10:14

1 Answers1

0

What you're trying to create are called dummy variables, an in R those are created using model.matrix(). Your specific application is a little special however, so some extra fiddling is required.

dtf <- data.frame(id=20:24, 
                  f=c("a", "b", "c", "a", "b"), 
                  g=c("A", "C", NA, "B", "A"),
                  h=c("P", "R", "Q", NA, "Q"))

# (the first column is not a categorical variable, hence not included)
dtf2 <- dtf[-1]

# Pre-allocate a list of the appropriate length
l <- vector("list", ncol(dtf2))

# Loop over each column in dtf2 and 
for (j in 1:ncol(dtf2)) {
    # Make sure to include NA as a level 
    data <- dtf2[j]
    data[] <- factor(dtf2[,j], exclude=NULL)

    # Generate contrasts that include all levels
    cont <- contrasts(data[[1]], contrasts=FALSE)

    # Create dummy variables using the above contrasts, excluding intercept
    # Formula syntax is the same as in e.g. lm(), except the response
    # variable (term to the left of ~) is not included. 
    # '-1' means no intercept, '.' means all variables
    modmat <- model.matrix(~ -1+., data=data, contrasts.arg=cont)

    # Find rows to fill with NA
    nacols <- grep(".*NA$", colnames(modmat))

    # Only do the operations if an NA-column was found
    if (length(nacols > 0)) {
       narows <- rowSums(modmat[, nacols, drop=FALSE]) > 0
       modmat[narows,] <- NA
       modmat <- modmat[,-nacols]
    }

    l[[j]] <- modmat
}

data.frame(dtf[1], do.call(cbind, l))
#   id fa fb fc gA gB gC hP hQ hR
# 1 20  1  0  0  1  0  0  1  0  0
# 2 21  0  1  0  0  0  1  0  0  1
# 3 22  0  0  1 NA NA NA  0  1  0
# 4 23  1  0  0  0  1  0 NA NA NA
# 5 24  0  1  0  1  0  0  0  1  0
AkselA
  • 8,153
  • 2
  • 21
  • 34
  • Thank you! I have some questions (many of them could sound stupid cause I am new to programming): 1) Why do you use dtf[-1] instead of just dtf? 2) What do you mean by contrasts? 3) when you define modmat, I don't get what is '~ -1+f+g'. If I have 77 columns, do I have to write all of them like this '~ gender+sexual_orientation+....'? Thank you! – Msms Apr 20 '19 at 11:24
  • @Msms: __1__: `dtf[-1]` because I decided the first column isn't a categorical variable. It might be a numeric variable, or a response variable (DV). Real data is often mixed like this. __2__: Contrasts is a pretty big subject, but in this case it simply specifies which dummies to include. [Further reading](https://stats.stackexchange.com/questions/78354/). __3__: I added an alternative method to the example. `-1` means no intercept, `.` means all variables, additively. `.^2` will give you all interactions as well. – AkselA Apr 20 '19 at 12:13
  • Look at `?formula` for more details on how to specify model formulae. – AkselA Apr 20 '19 at 12:23
  • Thank you so much! Do you also have suggestions on how to add the rest of my data frame back to the new dataset? For example, in you example, how do you add back the column 'i' in the right place? – Msms Apr 20 '19 at 19:00
  • @Msms: `cbind()` should suffice, but `data.frame()` is safer if you have mixed classes. I added it to the example. – AkselA Apr 20 '19 at 19:43
  • @Msms: To get exactly what you want I don't think there is a way around looping over each column, do the NA manipulations, and then stick everything together again. – AkselA Apr 21 '19 at 22:22