4

Suppose I have a data frame that looks something like this:

df1=structure(list(Name = structure(1:6, .Label = c("N1", "N2", "N3", 
                                                    "N4", "N5", "N6", "N7"), class = "factor"), sector = structure(c(4L, 
                                                                                                                     4L, 4L, 3L, 3L, 2L), .Label = c("other stuff", "Private for-profit, 4-year or above", 
                                                                                                                                                     "Private not-for-profit, 4-year or above", "Public, 4-year or above"
                                                                                                                     ), class = "factor"), flagship = c(1, 0, 0, 0, 0, 0)), .Names = c("Name", 
                                                                                                                                                                                       "sector", "flagship"), row.names = c(NA, 6L), class = "data.frame")

I want to create a new factor variable, "Sector". I can do it in a long way with many lines of code, but I'm sure there is a more efficient way.

Right now this is what I'm doing:

df1$PublicFlag=0
df1$PublicFlag[df1$sector=="Public, 4-year or above" & df1$flagship==1]=1
df1$Public=0
df1$Public[df1$sector=="Public, 4-year or above" & df1$flagship==0]=1
df1$PrivateNP=0
df1$PrivateNP[df1$sector=="Private not-for-profit"]=1
df1$Private4P=0
df1$Private4P[df1$sector=="Private for-profit, 4-year or above"]=1

library(reshape)
df2 = melt(df1, id=c("Name", "sector", "flagship"))
df2 = df2[df2$value==1,c("Name", "sector", "flagship", "variable")]
library(plyr)
df2 = rename(df2, c("variable"="Sector"))

Thanks for the help!

Ignacio
  • 7,646
  • 16
  • 60
  • 113

4 Answers4

13

It's an old post, but I often stumble across it. That's why I want to give an up-to-date answer. Version 0.5.0 of dplyr introduced a lot of useful vector functions to solve this problem.

Avoiding ifelse-nesting (and thus keeping many, many kittens alive) with case_when():

df1 %>% 
  mutate(Sector = case_when(
        sector=="Public, 4-year or above" & flagship==1 ~ "PublicFlag",
        sector=="Public, 4-year or above" & flagship==0 ~ "Public",
        sector=="Private not-for-profit" ~ "PrivateNP",
        sector=="Private for-profit, 4-year or above" ~ "Private4P"),
    Sector = factor(Sector, levels=c("Public","PublicFlag","PrivateNP","Private4P"))
  )

Generation factor from character (or numeric) variable with recode_factor():

df1 %>%
    mutate(Sector = recode_factor(sector,
                               "Public, 4-year or above" = "Public",
                               "Private not-for-profit" = "PrivateNP",
                               "Private for-profit, 4-year or above" = "Private4P"))
MarkusN
  • 3,051
  • 1
  • 18
  • 26
2

Try:

df1$Sector <-  with(df1, c("Private4P", NA, "Public",
                 "PublicFlag")[as.numeric(factor(1+2*as.numeric(sector)+4*flagship))])



 subset(df1, !is.na(Sector))
 #  Name                              sector flagship     Sector 
 #1   N1             Public, 4-year or above        1 PublicFlag
 #2   N2             Public, 4-year or above        0     Public
 #3   N3             Public, 4-year or above        0     Public
 #6   N6 Private for-profit, 4-year or above        0  Private4P
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You don't really even need dplyr:

df1$Sector <- factor(ifelse(df1$sector=="Public, 4-year or above" & df1$flagship==1, "PublicFlag",
                       ifelse(df1$sector=="Public, 4-year or above" & df1$flagship==0, "Public",
                         ifelse(df1$sector=="Private not-for-profit", "PrivateNP", 
                           ifelse(df1$sector=="Private for-profit, 4-year or above", "Private4P", NA)))))


df1

##   Name                                  sector flagship     Sector
## 1   N1                 Public, 4-year or above        1 PublicFlag
## 2   N2                 Public, 4-year or above        0     Public
## 3   N3                 Public, 4-year or above        0     Public
## 4   N4 Private not-for-profit, 4-year or above        0       <NA>
## 5   N5 Private not-for-profit, 4-year or above        0       <NA>
## 6   N6     Private for-profit, 4-year or above        0  Private4P

You can replace NA with the final possible factor level if you need it

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
0

The selected answer did not work for a particular problem I was working on because I assigned numerical values in case_when() and tried to give character levels to that. I wanted to add what I did to solve my particular problem as an alternative in case someone might find it useful in the future.

df1 %>% 
  mutate(Sector = case_when(
        sector=="Public, 4-year or above" & flagship==1 ~ "PublicFlag",
        sector=="Public, 4-year or above" & flagship==0 ~ "Public",
        sector=="Private not-for-profit" ~ "PrivateNP",
        sector=="Private for-profit, 4-year or above" ~ "Private4P") %>%
  as.factor() %>%
  structure(levels = c("Public","PublicFlag","PrivateNP","Private4P"))
  )
jamesguy0121
  • 1,124
  • 11
  • 28