Missing values with transform() and colsplit()

Question

I have a df:

Name    Letter
 1      A;B;C;D;E
 2      A;B;C;
 3      A;
 4      A;B;C;D;E

I use the following code to make a df where each Letter is split into it's own column using:

library(reshape2)

new_df = transform(df, taxa = colsplit(Letter, split = ";", names = c("A", "B", "C", "D", "E")))

When I do this I get a new df that looks like:

Name    .A   .B   .C   .D   .E
  1     A    B    C    D    E
  2     A    B    C    C    C
  3     A    A    A    A    A
  4     A    B    C    D    E

How do I make it so that missing letters aren't replaced by previous letter, but by a specific designator like "unclassified" so

Name    .A   .B   .C   .D   .E        
   2     A    B    C    C    C

becomes:

Name    .A   .B   .C       .D       .E
   2     A    B    C    unclass  unclass

You first do [Split a column of a data frame to multiple columns](https://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns) and then [How do I replace NA values with zeros in an R dataframe](https://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-an-r-dataframe) — Ronak Shah, Dec 29 '17 at 02:48

score 2 · Accepted Answer · answered Dec 29 '17 at 02:34

We can use the cSplit function from the splitstackshape package. After that, replace NA with "unclass".

library(splitstackshape)

df2 <- cSplit(df, "Letter", sep = ";", type.convert = FALSE)

df2[is.na(df2)] <- "unclass"

df2
#    Name Letter_1 Letter_2 Letter_3 Letter_4 Letter_5
# 1:    1        A        B        C        D        E
# 2:    2        A        B        C  unclass  unclass
# 3:    3        A  unclass  unclass  unclass  unclass
# 4:    4        A        B        C        D        E

DATA

df <- read.table(text = "Name    Letter
 1      A;B;C;D;E
 2      A;B;C;
 3      A;
 4      A;B;C;D;E",
                 header = TRUE, stringsAsFactors = FALSE)

Kevin Arseneau · Answer 2 · 2017-12-29T06:03:41.147

For a tidyverse style approach, I offer:

library(tidyr)
library(dplyr)
library(purrr)
library(tibble)

df <- tribble(
  ~name, ~letter,
  1, "A;B;C;D;E",
  2, "A;B;C;E",
  3, "A;",
  4, "A;B;C;D;E",
  5, "D;A;C"
)

df %>%
  mutate(letter = strsplit(letter, ";")) %>%
  unnest %>%
  spread(letter, -name) %>%
  imap_dfr(~case_when(
    .y == "name" ~ as.character(.x),
    is.na(.x) ~ "unclass",
    TRUE ~ .y
  ))

# # A tibble: 5 x 6
#   name  A     B       C       D       E      
#   <chr> <chr> <chr>   <chr>   <chr>   <chr>  
# 1 1     A     B       C       D       E      
# 2 2     A     B       C       unclass E      
# 3 3     A     unclass unclass unclass unclass
# 4 4     A     B       C       D       E      
# 5 5     A     unclass C       D       unclass

N.B. The key benefit here is that column positions are respected when there is a gap in the sequence or it is out of order, see the changed value when name == 2 with A;B;C;E and name == 5 with D;A;C.

Missing values with transform() and colsplit()

2 Answers2