-1

I have a dataframe with a mix of continuous and categorical data.

df<- data.frame(gender=c("male","female","transgender"),
                    education=c("high-school","grad-school","home-school"),
                    smoke=c("yes","no","prefer not tell"))
> print(df)
       gender   education           smoke
1        male high-school             yes
2      female grad-school              no
3 transgender home-school prefer not tell
> str(df)
'data.frame':   3 obs. of  3 variables:
 $ gender   : chr  "male" "female" "transgender"
 $ education: chr  "high-school" "grad-school" "home-school"
 $ smoke    : chr  "yes" "no" "prefer not tell"

I'm trying to recode the categorical columns to nominal format. My current approach is significantly tedious. First, I have to convert all character variables to factor format,

# Coerce all character formats to Factors
df<- data.frame(df[sapply(df, is.character)] <-
  lapply(df[sapply(df, is.character)], as.factor))

library(plyr)
df$gender<- revalue(df$gender,c("male"="1","female"="2","transgender"="3"))
df$education<- revalue(df$education,c("high-school"="1","grad-school"="2","home-school"="3"))
df$smoke<- revalue(df$smoke,c("yes"="1","no"="2","prefer not tell"="3"))
> print(df)
  gender education smoke
1      1         1     1
2      2         2     2
3      3         3     3

Is there a more elegant way to approach this problem? Something along the lines of tidyverse style will be helpful. I have already seen somewhat similar questions like 1, 2,3. The issue with these solutions are either they are not relevant to what I seek or else they using base R approaches like lapply() or sapply(), which is difficult for me to interpret. I would also like to know if there is an elegant approach to convert all character variables to factor format along the lines of tidyverse approach.

mnm
  • 1,962
  • 4
  • 19
  • 46
  • Maybe try `df %>% mutate(across(gender:smoke,~as.numeric(.)))` if variables are factors! – Duck Sep 18 '20 at 00:48

3 Answers3

2

Try this. Just take into account that we are using mutate() and across() twice in order to first transform values to factor ordered by how they appear in each variable (unique()), and then the numeric side with as.numeric() to extract the values. Here the code:

library(tidyverse)
#Code
df %>% mutate(across(gender:smoke,~factor(.,levels = unique(.)))) %>%
  mutate(across(gender:smoke,~as.numeric(.)))

Output:

  gender education smoke
1      1         1     1
2      2         2     2
3      3         3     3

And in order to identify how the new values will be assigned you can use this:

#Code 2
df %>% summarise_all(.funs = unique) %>% pivot_longer(everything()) %>%
  arrange(name) %>%
  group_by(name) %>% mutate(Newval=1:n())

Output:

# A tibble: 9 x 3
# Groups:   name [3]
  name      value           Newval
  <chr>     <fct>            <int>
1 education high-school          1
2 education grad-school          2
3 education home-school          3
4 gender    male                 1
5 gender    female               2
6 gender    transgender          3
7 smoke     yes                  1
8 smoke     no                   2
9 smoke     prefer not tell      3

Or maybe for more control:

#Code 3
df %>% mutate(id=1:n()) %>% pivot_longer(-id) %>%
  left_join(df %>% summarise_all(.funs = unique) %>% pivot_longer(everything()) %>%
              arrange(name) %>%
              group_by(name) %>% mutate(Newval=1:n()) %>% ungroup()) %>%
  select(-value) %>%
  pivot_wider(names_from = name,values_from=Newval) %>%
  select(-id)

Output:

# A tibble: 3 x 3
  gender education smoke
   <int>     <int> <int>
1      1         1     1
2      2         2     2
3      3         3     3

In case your variables are of class character you can use this pipeline to transform from character to factor, then re organize the factor and then make them numeric:

#Code 4
df %>% 
  mutate(across(gender:smoke,~as.factor(.))) %>%
  mutate(across(gender:smoke,~factor(.,levels = unique(.)))) %>%
  mutate(across(gender:smoke,~as.numeric(.)))

Output:

  gender education smoke
1      1         1     1
2      2         2     2
3      3         3     3
Duck
  • 39,058
  • 13
  • 42
  • 84
  • really appreciate for the quick response including alternative possible solutios, but are you sure it works? I have checked the proposed solution multiple times including restarting RStudio, and I'm not able to reproduce the solution. It still prints the data in categorical format and not nominal. – mnm Sep 18 '20 at 01:00
  • @mnm Hi, I have tested with your `df`. Maybe check the `str()` of `df`, are they factors or characters? I have your `df` as factor variables. Let me know how that goes! – Duck Sep 18 '20 at 01:02
  • @mnm Also check I saw you load `plyr`. Maybe is producing issues with `dplyr`. Try first loading `plyr` and after that `dplyr` or the `tidyverse` – Duck Sep 18 '20 at 01:05
  • it seems to work now. The problem was the variables were of character data type. I have revised the Q to include a tedious approach for converting character to factor. Is there another way to it along the lines of tidyverse style? – mnm Sep 18 '20 at 01:23
  • @mnm I have added an option, you basically can use `across()` with `as.factor()` and the continue with the pipeline included at the start of the post. I hope that can be helpful! – Duck Sep 18 '20 at 01:38
  • 1
    many thanks for the solution. It works now. I'm accepting your proposed solution because you've not only answered the Q but also suggested possible solutions to further question extensions.. – mnm Sep 18 '20 at 02:32
1

You can turn the character and factor column in your data to numeric giving each level a unique value based on their occurrence in the data.

library(dplyr)

df %>% 
  mutate(across(where(~is.character(.) | is.factor(.)), ~match(., unique(.))))

#  gender education smoke
#1      1         1     1
#2      2         2     2
#3      3         3     3
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Base R solution:

lapply(df, function(x){
    if(is.character(x) | is.factor(x)){
      x <- as.integer(labels(as.factor(x)))
    }else{
      x
    }
  }
)
hello_friend
  • 5,682
  • 1
  • 11
  • 15