R tidyr: use separate function to separate character column with comma-separated text into multiple columns using RegEx

Question

I have the following dataframe

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three"))

It looks like this

                x
1             one
2        one, two
3      two, three
4 one, two, three

I would like to be able to separate this x column into many different columns, one for every distinct word in the column x. Basically I would like the final result to be something like this

    one  two  three
1    1    0     0
2    1    1     0
3    0    1     1
4    1    1     1

I think that in order to obtain that dataframe, I probably need to be able to use the separate function provided by tidyr and documented here. However, this requires knowledge of regular expressions, and I'm not good with them. Can anyone help me obtain this dataframe?

IMPORTANT: I do not know the number, nor the spelling of the words a priori.

Important Example

It should work also with empty strings. For instance if we have

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three", ""))

then it should also work.

Try `library(splitstackshape); cSplit_e(df, split.col = "x", fixed = TRUE, type = "character", drop = TRUE, fill = 0L)` — markus, Dec 28 '19 at 18:34
Possible dupe: [R: Split Variable Column into multiple (unbalanced) columns by comma](https://stackoverflow.com/questions/31577423/r-split-variable-column-into-multiple-unbalanced-columns-by-comma) — markus, Dec 28 '19 at 18:38

ThomasIsCoding · Answer 1 · 2019-12-28T20:57:08.460

2

Here is a base R solution

# split strings by ", " and save in to a list `lst`
lst <- apply(df, 1, function(x) unlist(strsplit(x,", ")))

# a common set including all distinct words
common <- Reduce(union,lst)

# generate matrix which is obtained by checking if `common` can be found in the array in `lst`
dfout <- `names<-`(data.frame(Reduce(rbind,lapply(lst, function(x) +(common %in% x))),row.names = NULL),common)

such that

> dfout
  one two three
1   1   0     0
2   1   1     0
3   0   1     1
4   1   1     1

edited Dec 28 '19 at 20:57

answered Dec 28 '19 at 20:49

ThomasIsCoding

96,636
9
24
81

Thank you! It seems to be working. Can you comment it a bit? – Euler_Salter Dec 28 '19 at 20:51
@Euler_Salter Yes, I added some comments, please see my updates – ThomasIsCoding Dec 28 '19 at 20:57
Can you explain `names<-`? Also I am having a hard time understanding `Reduce` from `?Reduce`. Thanks! – its.me.adam May 20 '20 at 17:10
1

@its.me.adam `names<-(x,val)` assign names to the object, which is equivalent to `names(x) <- vals`. for `Reduce`, maybe you can get more information from https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/ – ThomasIsCoding May 21 '20 at 12:02

akrun · Accepted Answer · 2019-12-28T19:00:21.640

1

With tidyverse, we can use separate_rows to split up the 'x' column, create a sequence column and use pivot_wider from tidyr

library(dplyr)
library(tidyr)
df %>% 
   filter(!(is.na(x)|x==""))%>% 
   mutate(rn = row_number()) %>% 
   separate_rows(x) %>%
   mutate(i1 = 1) %>% 
   pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
   select(-rn)
# A tibble: 4 x 3
#    one   two three
#  <dbl> <dbl> <dbl>
#1     1     0     0
#2     1     1     0
#3     0     1     1
#4     1     1     1

In the above code, the rn column was added to have distinct identifier for each rows after we expand the rows with separate_rows, otherwise, it can result in a list output column in pivot_wider when there are duplicate elements. The 'i1' with value 1 is added to be used in the values_from. Another option is to specify values_fn = length

Or we can use table after splitting the 'x' column in base R

table(stack(setNames(strsplit(as.character(df$x), ",\\s+"), seq_len(nrow(df))))[2:1])

edited Dec 28 '19 at 19:00

answered Dec 28 '19 at 18:30

akrun

874,273
37
540
662

I love the `tidyverse` solution. Would you be able to add some more explanation to it? I can see it works, but not 100% sure why. The first `mutate` creates a column with row numbers. Then `separate_rows` somehow separates the comma-delimited words into a single row for each word. Then you create a column of `1`s with `mutate`... and this is where I get lost – Euler_Salter Dec 28 '19 at 18:37
When I try this with a different dataset (containing NA values) the tidyverse solution throws an error. – Euler_Salter Dec 28 '19 at 18:49
@Euler_Salter can you please update your post with an example. I couldn't get an error with `df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three", NA))` – akrun Dec 28 '19 at 18:51
Will do in a second! – Euler_Salter Dec 28 '19 at 18:53
It actually fails with `""` rather than NA. I'll post an example – Euler_Salter Dec 28 '19 at 18:54
when `df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three", ""))` I get `Error: Column 4 must be named. Use .name_repair to specify repair.` – Euler_Salter Dec 28 '19 at 18:56
@Euler_Salter You can either convert the `""` to `NA` or use `filter` i.e. `df %>% filter(x != "") %>% mutate(rn = row_number()) %>% + separate_rows(x) %>% + mutate(i1 = 1) %>% + pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>% + select(-rn)` – akrun Dec 28 '19 at 18:58
I am, there’s something failing in my real use case. I’ll get back to you as soon possible – Euler_Salter Dec 28 '19 at 19:22
Okay, I think I know where the problem is. In my example I have rows such as `"one and two, one, two"` and `"two and three, one, three"`. So I think the only tweek needed is to use `separate_rows(x, sep=",")` – Euler_Salter Dec 28 '19 at 20:17
@Euler_Salter. In that case, you can add the `sep = "\\s*[and,]\\s*")` – akrun Dec 28 '19 at 20:18
What does it do? I basically it needs to be able to work on normal strings such as `one, two` (here it would get `one` and `two`), on strings ending in a comma `one, two,` (here would get `one` and `two`) and in empty strings `""` – Euler_Salter Dec 28 '19 at 20:22
to me `one and two` is to be considered different from both `one` and from `two` – Euler_Salter Dec 28 '19 at 20:22
1

@Euler_Salter ok, in that caes, just use the previous `sep="\\s*,\\s*"` – akrun Dec 28 '19 at 20:23
1

Thank you so much for helping me out! I've asked you a lot of things, but you managed to handle all of them, well done!! – Euler_Salter Dec 28 '19 at 20:49
Would you be able to help me with a very similar problem, the main difference is how the information is extracted from the character column? – Euler_Salter Dec 28 '19 at 21:43
@Euler_Salter Sure, is it a new question – akrun Dec 28 '19 at 21:45
1

Yes, I'm about to finish writing it – Euler_Salter Dec 28 '19 at 21:48
Here it is: https://stackoverflow.com/questions/59515177/r-tidyr-regex-extract-ordered-numbers-from-character-column – Euler_Salter Dec 28 '19 at 21:55

score 1 · Answer 3 · answered Dec 28 '19 at 22:38

You could build a pattern from your columns and use it with tidyr::extract() :

library(tidyverse)
cols <- c("one","two","three")
pattern <- paste0("(",cols,")*", collapse= "(?:, )*")
df %>% 
  extract(x, into = c("one","two","three"), regex = pattern) %>%
  mutate_all(~as.numeric(!is.na(.)))
#>   one two three
#> 1   1   0     0
#> 2   1   1     0
#> 3   0   1     1
#> 4   1   1     1

R tidyr: use separate function to separate character column with comma-separated text into multiple columns using RegEx

Important Example

3 Answers3