3

I have a dataframe like this, where the values are separated by comma.

# Events
# A,B,C
# C,D
# B,A
# D,B,A,E
# A,E,B

I would like to have the next data frame

# Event1  Event2  Event3  Event4  Event5
# A       B       C       NA      NA
# NA      NA      C       NA      NA
# A       B       NA      NA      NA
# A       B       NA      D       E
# A       B       NA      NA      E

I have tried with cSplit but I don't have the desired df. Is possible?

NOTE: The values doesn't appear in the same possition as the variable Event in the second dataframe.

  • 1
    Does this answer your question? [Split strings by commas into columns](https://stackoverflow.com/questions/70501700/split-strings-by-commas-into-columns) – SamR Jul 11 '22 at 15:14
  • Thanks @SamR but it doesn't. The difference is that my dataframe needs to be ordered. – Kevin Santos Jul 11 '22 at 15:25

4 Answers4

3

1) Here is a base R solution. split each row giving list s and create cols which contains the possible values. Then iterate over s and convert that to a data frame.

Note that this does not hard code the column names and continues to work even if some column names are substrings of other column names.

s <- strsplit(DF$Events, ",")
cols <- unique(sort(unlist(s)))

data.frame(Event = t(sapply(s, function(x) ifelse(cols %in% x, cols, NA))))

giving:

  Event.1 Event.2 Event.3 Event.4 Event.5
1       A       B       C    <NA>    <NA>
2    <NA>    <NA>       C       D    <NA>
3       A       B    <NA>    <NA>    <NA>
4       A       B    <NA>       D       E
5       A       B    <NA>    <NA>       E

2) This base R solution uses strsplit as above and then names the components since stack requires a named list and then invokes stack. Then we expand that into a wide form using tapply and convert it to a data frame and fix up the names.

s <- strsplit(DF$Events, ",")
names(s) <- seq_along(s)
stk <- stack(s)

mat <- t(tapply(stk$values, stk, c))
colnames(mat) <- NULL
data.frame(Event = mat)

giving:

  Event.1 Event.2 Event.3 Event.4 Event.5
1       A       B       C    <NA>    <NA>
2    <NA>    <NA>       C       D    <NA>
3       A       B    <NA>    <NA>    <NA>
4       A       B    <NA>       D       E
5       A       B    <NA>    <NA>       E

This could also be represented as an R 4.2+ pipeline:

DF |>
  with(setNames(Events, seq_along(Events))) |>
  strsplit(",") |>
  stack() |>
  with(tapply(values, data.frame(ind, values), c)) |>
  `colnames<-`(NULL) |>
  data.frame(Event = _)

Note

The input in reproducible form:

Lines <- "Events
A,B,C
C,D
B,A
D,B,A,E
A,E,B"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

Another approach using tidyverse:

library(dplyr)
library(purrr)
library(stringr)

Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")

letters <- Events %>% str_split(",") %>% unlist() %>% unique()

df <- data.frame(Events)

df %>% 
  map2_dfc(.y = letters, ~ ifelse(str_detect(.x, .y), .y, NA)) %>% 
  set_names(nm = paste0("Events", 1:length(letters)))

#> # A tibble: 5 × 5
#>   Events1 Events2 Events3 Events4 Events5
#>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 A       B       C       <NA>    <NA>   
#> 2 <NA>    <NA>    C       D       <NA>   
#> 3 A       B       <NA>    <NA>    <NA>   
#> 4 A       B       <NA>    D       E      
#> 5 A       B       <NA>    <NA>    E

Created on 2022-07-11 by the reprex package (v2.0.1)

shafee
  • 15,566
  • 3
  • 19
  • 47
0

This tidyverse solution is easily the most economical in terms of amount of code used:

library(tidyverse)

data.frame(Events) %>%
  # split the strings by the comma:
  mutate(Events = str_split(Events, ",")) %>%
  # unnest splitted values wider into columns:
  unnest_wider(Events, names_sep = "")
# A tibble: 5 × 4
  Events1 Events2 Events3 Events4
  <chr>   <chr>   <chr>   <chr>  
1 A       B       C       NA     
2 C       D       NA      NA     
3 B       A       NA      NA     
4 D       B       A       E      
5 A       E       B       NA 

Data:

Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

We can try the following base R code

> d <- t(table(stack(setNames(strsplit(df$Events, ","), 1:nrow(df)))))

> as.data.frame.matrix(`dim<-`(colnames(d)[ifelse(d > 0, d * col(d), NA)], dim(d)))
    V1   V2   V3   V4   V5
1    A    B    C <NA> <NA>
2 <NA> <NA>    C    D <NA>
3    A    B <NA> <NA> <NA>
4    A    B <NA>    D    E
5    A    B <NA> <NA>    E
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81