0

I can't believe I'm struggling with this after just 3 years of taking a break from R...

Basically, I'm having a dataframe where in the first column all possible values are listed. The subsequent columns have either one of the values or NA. I would love to sort every subsequent column in a way so that when it contains the value at some point, it shall be in the row where that value is in the first column.

Probably easier to explain with an example:

Original:

Letters Category1 Category2 Category3 Category4
A       NA        A         NA        NA
B       A         NA        NA        D
C       NA        NA        NA        A
D       NA        C         NA        NA
E       E         B         C         NA

Desired state:

Letters Category1 Category2 Category3 Category4
A       A         A         NA        A
B       NA        B         NA        NA
C       NA        C         C         NA
D       NA        NA        NA        D
E       E         NA        NA        NA

Am I overlooking a built-in function / library to do this elegantly? My approach would probably to build a function that creates a new dataframe and checks the contents row by row, but this seems very inefficient...

  • `cbind(Letters = df$Letters, apply(df[, startsWith(names(df), "Category")], 2, function(x) { y <- df$Letters; y[setdiff(seq_len(length(x)), sort(match(x, y)))] <- NA_character_; y }))` – hello_friend Dec 01 '20 at 22:13

2 Answers2

1

use tidyverse

df <- read.table(text = "Letters Category1 Category2 Category3 Category4
A       NA        A         NA        NA
B       A         NA        NA        D
C       NA        NA        NA        A
D       NA        C         NA        NA
E       E         B         C         NA", header = T)

library(tidyverse)
df %>% 
  pivot_longer(-Letters, values_drop_na = T) %>% 
  mutate(Letters = value) %>% 
  arrange(name) %>% 
  pivot_wider(Letters, names_from = name, values_from = value)

#> # A tibble: 5 x 5
#>   Letters Category1 Category2 Category3 Category4
#>   <chr>   <chr>     <chr>     <chr>     <chr>    
#> 1 A       A         A         <NA>      A        
#> 2 E       E         <NA>      <NA>      <NA>     
#> 3 C       <NA>      C         C         <NA>     
#> 4 B       <NA>      B         <NA>      <NA>     
#> 5 D       <NA>      <NA>      <NA>      D

Created on 2020-12-01 by the reprex package (v0.3.0)

use data.table

library(data.table)
dt <- as.data.table(df)
dt_long <- melt(data = dt, id.vars = "Letters", na.rm = T)
dcast(dt_long, value ~ variable, value.var = "value")

#>    value Category1 Category2 Category3 Category4
#> 1:     A         A         A      <NA>         A
#> 2:     B      <NA>         B      <NA>      <NA>
#> 3:     C      <NA>         C         C      <NA>
#> 4:     D      <NA>      <NA>      <NA>         D
#> 5:     E         E      <NA>      <NA>      <NA>

Created on 2020-12-01 by the reprex package (v0.3.0)

Yuriy Saraykin
  • 8,390
  • 1
  • 7
  • 14
0

One way using dplyr and tidyr would be :

library(dplyr)
library(tidyr)


vals <- df$Letters

df %>%
  pivot_longer(cols = starts_with('Category'), 
               values_drop_na = TRUE) %>%
  group_by(value) %>%
  mutate(Letters = vals[cur_group_id()]) %>%
  arrange(name) %>%
  pivot_wider() %>%
  arrange(Letters) -> result

result

# Letters Category1 Category2 Category3 Category4
#  <chr>   <chr>     <chr>     <chr>     <chr>    
#1 A       A         A         NA        A        
#2 B       NA        B         NA        NA       
#3 C       NA        C         C         NA       
#4 D       NA        NA        NA        D        
#5 E       E         NA        NA        NA        
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213