1

I currently have a dataset in R that is in long format and I'm trying to make it wide with a couple of specifications. So my dataset has a respondent ID and their gender along with one other column (let's say "fruits") that I'm interested in.

id <- c(1,1,1,2,3,4,4,5,5,5)
gender <- c("F","F","F","M","Unknown","F","F","M","M","M")
fruit <- c("pear", "apple", "banana", "pear", "strawberry","banana", "banana", "pear", "strawberry", "banana")
df <- cbind(id, gender, fruit)

      id  gender    fruit       
 [1,] "1" "F"       "pear"      
 [2,] "1" "F"       "apple"     
 [3,] "1" "F"       "banana"    
 [4,] "2" "M"       "pear"      
 [5,] "3" "Unknown" "strawberry"
 [6,] "4" "F"       "banana"    
 [7,] "4" "F"       "banana"    
 [8,] "5" "M"       "pear"      
 [9,] "5" "M"       "strawberry"
[10,] "5" "M"       "banana" 

My goal is to create a binary column for each potential response in that column to determine whether that individual provided that response at all throughout the dataset.

id <- c(1,2,3,4,5)
gender <- c("F","M","Unknown","F","M")
pear <- c(1,1,0,0,1)
apple <- c(1,0,0,0,0)
banana <- c(1,0,0,1,1)
strawberry <- c(0,0,1,0,1)
df2 <- cbind(id, gender, pear, apple, banana, strawberry)

     id  gender    pear apple banana strawberry
[1,] "1" "F"       "1"  "1"   "1"    "0"       
[2,] "2" "M"       "1"  "0"   "0"    "0"       
[3,] "3" "Unknown" "0"  "0"   "0"    "1"       
[4,] "4" "F"       "0"  "0"   "1"    "0"       
[5,] "5" "M"       "1"  "0"   "1"    "1"  

I'm not sure if this is important to note but while in it's original long format, a specific ID can have multiple rows with the same response to "fruits" like I have shown with my mock respondent 4. I hope that's clear; thanks very much in advance!

user21027866
  • 129
  • 5

3 Answers3

3

You can use fastDummies:dummyCols:

library(dplyr) #1.1.0+ or above required
df %>% 
  summarise(fruit = toString(fruit), .by = id) %>% 
  fastDummies::dummy_cols("fruit", split = ", "
                          remove_selected_columns = TRUE)

  id fruit_banana fruit_pear fruit_apple fruit_strawberry
1  1            1          1           1                0
2  2            0          1           0                0
3  3            0          0           0                1
4  4            1          0           0                0
5  5            1          1           0                1

Alternatively, in a single pivot_wider call, you can do:

df %>% 
  pivot_wider(names_from = fruit, values_from = fruit,
              values_fn = ~ +(length(.x) > 0), values_fill = 0)
Maël
  • 45,206
  • 3
  • 29
  • 67
3

Here is a pure Tidyverse approach:

library(tidyverse)

df |>
  distinct() |>
  pivot_wider(
    names_from = fruit,
    values_from = fruit,
    values_fn = \(x) 1,
    values_fill = 0
  )
#> # A tibble: 5 × 6
#>   id    gender   pear apple banana strawberry
#>   <chr> <chr>   <dbl> <dbl>  <dbl>      <dbl>
#> 1 1     F           1     1      1          0
#> 2 2     M           1     0      0          0
#> 3 3     Unknown     0     0      0          1
#> 4 4     F           0     0      1          0
#> 5 5     M           1     0      1          1

Created on 2023-04-12 with reprex v2.0.2

dufei
  • 2,166
  • 1
  • 7
  • 18
2

Here is another approach: First we create dummy binary variables from factor fruit, then we summarise across and find the max:

library(dplyr)

df %>%
  as.data.frame() %>% 
  mutate(fruit = as.factor(fruit)) %>% 
  cbind(sapply(levels(.$fruit), `==`, .$fruit)*1) %>% 
  select(-fruit) %>% 
  summarise(across(everything(), ~max(.)), .by=c(id, gender))
  id  gender apple banana pear strawberry
1  1       F     1      1    1          0
2  2       M     0      0    1          0
3  3 Unknown     0      0    0          1
4  4       F     0      1    0          0
5  5       M     0      1    1          1
TarJae
  • 72,363
  • 6
  • 19
  • 66