0

I have a df:

Name    Letter
 1      A;B;C;D;E
 2      A;B;C;
 3      A;
 4      A;B;C;D;E

I use the following code to make a df where each Letter is split into it's own column using:

library(reshape2)

new_df = transform(df, taxa = colsplit(Letter, split = ";", names = c("A", "B", "C", "D", "E"))) 

When I do this I get a new df that looks like:

Name    .A   .B   .C   .D   .E
  1     A    B    C    D    E
  2     A    B    C    C    C
  3     A    A    A    A    A
  4     A    B    C    D    E

How do I make it so that missing letters aren't replaced by previous letter, but by a specific designator like "unclassified" so

Name    .A   .B   .C   .D   .E        
   2     A    B    C    C    C

becomes:

Name    .A   .B   .C       .D       .E
   2     A    B    C    unclass  unclass
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
amwalker
  • 345
  • 1
  • 8
  • 17
  • 1
    You first do [Split a column of a data frame to multiple columns](https://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns) and then [How do I replace NA values with zeros in an R dataframe](https://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-an-r-dataframe) – Ronak Shah Dec 29 '17 at 02:48

2 Answers2

2

We can use the cSplit function from the splitstackshape package. After that, replace NA with "unclass".

library(splitstackshape)

df2 <- cSplit(df, "Letter", sep = ";", type.convert = FALSE)

df2[is.na(df2)] <- "unclass"

df2
#    Name Letter_1 Letter_2 Letter_3 Letter_4 Letter_5
# 1:    1        A        B        C        D        E
# 2:    2        A        B        C  unclass  unclass
# 3:    3        A  unclass  unclass  unclass  unclass
# 4:    4        A        B        C        D        E

DATA

df <- read.table(text = "Name    Letter
 1      A;B;C;D;E
 2      A;B;C;
 3      A;
 4      A;B;C;D;E",
                 header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
1

For a tidyverse style approach, I offer:

library(tidyr)
library(dplyr)
library(purrr)
library(tibble)

df <- tribble(
  ~name, ~letter,
  1, "A;B;C;D;E",
  2, "A;B;C;E",
  3, "A;",
  4, "A;B;C;D;E",
  5, "D;A;C"
)

df %>%
  mutate(letter = strsplit(letter, ";")) %>%
  unnest %>%
  spread(letter, -name) %>%
  imap_dfr(~case_when(
    .y == "name" ~ as.character(.x),
    is.na(.x) ~ "unclass",
    TRUE ~ .y
  ))

# # A tibble: 5 x 6
#   name  A     B       C       D       E      
#   <chr> <chr> <chr>   <chr>   <chr>   <chr>  
# 1 1     A     B       C       D       E      
# 2 2     A     B       C       unclass E      
# 3 3     A     unclass unclass unclass unclass
# 4 4     A     B       C       D       E      
# 5 5     A     unclass C       D       unclass

N.B. The key benefit here is that column positions are respected when there is a gap in the sequence or it is out of order, see the changed value when name == 2 with A;B;C;E and name == 5 with D;A;C.

Kevin Arseneau
  • 6,186
  • 1
  • 21
  • 40