Split concatenated column to corresponding column positions

Question

I have a data frame where a column may contain concatenated characters separated by |:

df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))

# df
#     FOO
# 1 A|B|C
# 2   A|B
# 3   B|C
# 4     A
# 5     C

I want to split the string and put the individual values into different columns:

df
#  X1 X2 X3
# 1 A  B  C
# 2 A  B
# 3    B  C
# 4 A
# 5       C

So far I tried with this example: [https://stackoverflow.com/questions/7069076/split-column-at-delimiter-in-data-frame][1] but it is not splitting the columns without repeating values, what I get there is:

df <- data.frame(do.call('rbind', strsplit(as.character(df$FOO),'|',fixed=TRUE)))

> df
  X1 X2 X3
1  A  B  C
2  A  B  A
3  B  C  B
4  A  A  A
5  C  C  C

And I also get this warning:

Warning message: In rbind(c("A", "B", "C"), c("A", "B"), c("B", "C"), "A", "C") : number of columns of result is not a multiple of vector length (arg 2)

What can I do in those cases? Preferably with base R. [1]: Split column at delimiter in data frame

Would the dataframe have values A, B, C... always ? I mean how do we decide which column a particular value should go? — Ronak Shah, Aug 16 '18 at 08:08
No, values might change but names remain the same. For example: `row1 = A|B|C` and `row22 = B|D|F`. — Biocrazy, Aug 16 '18 at 08:12
So if you have only these two rows how many columns would you have? 6 (A, B, C, D, E, F) or 5 (A, B, C, D, F) ? — Ronak Shah, Aug 16 '18 at 08:14
At the begining (in `df`) I have only one column with a changing number of rows. What it is inside the rows (`A|B|C`) always changes. Sometimes the max length can be 7 (`A|B|C|D|E|F|G`), others 1 (`A`), others 2 (`A|C`), and so on. — Biocrazy, Aug 16 '18 at 08:18
One last thing I want to tell you: When you look at `df` you see your column is of type `factor`. That's often a very unwanted thing. So you could add `,stringsAsFactors=F` into the data.frame definition. — Andre Elrico, Aug 16 '18 at 08:54
Yes, just added `as.data.frame()` and `stringsAsFactors = F` and it still works great! Thanks for your help! — Biocrazy, Aug 16 '18 at 08:55
are your values always a single letter? such as `A|C` and not `cat|dog3`? — s_baldur, Aug 16 '18 at 10:30
The accepted answer works for single letters and words. So that is not really important. — Biocrazy, Aug 16 '18 at 10:32

Andre Elrico · Accepted Answer · 2018-08-16T08:38:23.953

9

Simply do:

splt <- strsplit(as.character(df$FOO),"\\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x){all_val[!(all_val %in% x)]<-NA;all_val}))


#     [,1] [,2] [,3]
#[1,] "A"  "B"  "C" 
#[2,] "A"  "B"  NA  
#[3,] NA   "B"  "C" 
#[4,] "A"  NA   NA  
#[5,] NA   NA   "C"

data:

df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))

Please note:

My version is base:: (no libraries needed) and general:

It would also work with:

df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))

edited Aug 16 '18 at 08:38

answered Aug 16 '18 at 08:32

Andre Elrico

10,956
6
50
69

1

Short, general and using `base::`... It works great! Thank you. – Biocrazy Aug 16 '18 at 08:42
3

It was a well-posed question and an interesting problem! – Andre Elrico Aug 16 '18 at 08:44
1

Please note the result is a `matrix` if that is problematic for you, just wrap `as.data.frame()` around it. – Andre Elrico Aug 16 '18 at 08:47

markus · Answer 2 · 2018-08-16T17:46:40.980

Overlooked that OP asked for a base R solution. Please try @AndreElrico's, @r.user.05apr's or @milan's solutions.

This can be done with cSplit_e from the splitstackshape package:

library(splitstackshape)
cSplit_e(
  data = df,
  split.col = "FOO",
  sep = "|",
  mode = "value",
  type = "character",
  fill = " ",
  drop = TRUE
)
#  FOO_A FOO_B FOO_C
#1     A     B     C
#2     A     B      
#3           B     C
#4     A            
#5                 C

Does also work in case of the following df (see OP's comment above).

(df1 <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F')))
#    FOO
#1 A|B|C
#2   A|B
#3   B|C
#4     A
#5     C
#6 B|D|F

cSplit_e(df1, "FOO", "|", "value",  "character", TRUE, fill = " ")
#  FOO_A FOO_B FOO_C FOO_D FOO_F
#1     A     B     C            
#2     A     B                  
#3           B     C            
#4     A                        
#5                 C            
#6           B           D     F

r.user.05apr · Answer 3 · 2018-08-16T12:34:10.510

4

In base R:

df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))

dummy <- strsplit(as.character(df$FOO), "[|]")
want <- data.frame(values = unlist(dummy),
                   ids = rep(1:length(dummy), unlist(lapply(dummy, length))), 
                   stringsAsFactors = FALSE)

library(reshape2)
want <- dcast(want, ids ~ values, value.var = "values", fill = " ")[, -1] # first col removed
names(want) <- paste0("X", seq_along(unique(unlist(dummy)))) 
want
# X1 X2 X3
#1  A  B  C
#2  A  B   
#3     B  C
#4  A      
#5        C

edited Aug 16 '18 at 12:34

answered Aug 16 '18 at 08:24

r.user.05apr

5,356
3
22
39

Thanks for your help, but why does this retrieves the values as rows? Is there a way to keep the columns of the `df`? – Biocrazy Aug 16 '18 at 08:28
Sorry! I just saw you used a library to "reshape" the columns. I thought they were different examples :-P – Biocrazy Aug 16 '18 at 08:30

milan · Answer 4 · 2018-08-17T05:59:47.263

2

Use unique and strsplit to find all unique values (A, B and C in this case). Use grep to search for the unique values, and return the values when there's a match or character(0) otherwise. cbind the resulting characters. Use apply and ifelse to replace character(0) with NA.

vals <- unique(unlist(sapply(a1, function(x) strsplit(x, '|', fixed = T))))

out <- NULL
for(i in vals){
  out <- cbind(out, as.character((lapply(df$FOO, function(x) grep(x, i, value=T)))))
}

apply(out, 2, function(x) ifelse(x=="character(0)", NA, x))

     [,1] [,2] [,3]
[1,] "A"  "B"  "C" 
[2,] "A"  "B"  NA  
[3,] NA   "B"  "C" 
[4,] "A"  NA   NA  
[5,] NA   NA   "C"

edited Aug 17 '18 at 05:59

answered Aug 16 '18 at 08:38

milan

4,782
2
21
39

Let's say I have 80 letters (`A`, `B`, `C`, `D`...) then do I have to use `grep` for each one of the cases? – Biocrazy Aug 16 '18 at 09:01
Changed it now so that you can easily apply it to many values. – milan Aug 16 '18 at 09:06

Roman · Answer 5 · 2018-08-16T10:18:55.347

You can try a tidyverse as well

library(tidyverse)
df %>%
  rownames_to_column() %>% 
  separate_rows(FOO, sep="[|]") %>% 
  mutate(L=factor(FOO, labels = paste0("X",1:length(unique(FOO))))) %>% 
  spread(L, FOO) %>% 
  select(-1)
    X1   X2   X3
1    A    B    C
2    A    B <NA>
3 <NA>    B    C
4    A <NA> <NA>
5 <NA> <NA>    C

It is also generally working e.g. df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F')). In addition you can set the levels e.g. B>C>A by yourself using levels = c("B", "C", "A") in the factor function in the mutate step.

Split concatenated column to corresponding column positions

5 Answers5

Linked

Related