Separate a column into multiple columns using tidyr::separate with sep=""

Question

df <- data.frame(category = c("X", "Y"), sequence = c("AAT.G", "CCG-T"), stringsAsFactors = FALSE)

df
 category sequence
1        X     AAT.G
2        Y     CCG-T

I want to separate the column sequence into 5 columns (one for each character). I tried to do that with tidyr::separate but it internally uses stringi::stri_split_regex which doesn't accept an empty string as a separator (although the sep argument should take a regex).

library(tidyr)
separate(df, sequence, into = paste0("V", 1:5), sep="")

Error: Values not split into 5 pieces at 1, 2
In addition: Warning messages:
1: In stringi::stri_split_regex(value, sep, n_max) :
  empty search patterns are not supported
2: In stringi::stri_split_regex(value, sep, n_max) :
  empty search patterns are not supported

Expected output looks like this:

  category V1 V2 V3 V4 V5
1        X  A  A  T  .  G
2        Y  C  C  G  -  T

akrun · Accepted Answer · 2015-03-10T04:51:30.777

5

You could do this with extract from tidyr

library(tidyr)
extract(df, sequence, into=paste0('V', 1:5), '(.)(.)(.)(.)(.)')
#  category V1 V2 V3 V4 V5
#1        X  A  A  T  .  G
#2        Y  C  C  G  -  T

Or create a delimiter with gsub and use that as sep for the separator

library(dplyr)
library(tidyr)
df %>% 
   mutate(sequence=gsub('(?<=.)(?=.)', ',', sequence, perl=TRUE)) %>% 
   separate(sequence, into=paste0('V', 1:5), sep=",")
#  category V1 V2 V3 V4 V5
#1        X  A  A  T  .  G
#2        Y  C  C  G  -  T

Or you can use cSplit

library(splitstackshape)
setnames(cSplit(df, 'sequence', '', stripWhite=FALSE),
             2:6, paste0('V', 1:5))[]
#   category V1 V2 V3 V4 V5
#1:        X  A  A  T  .  G
#2:        Y  C  C  G  -  T

edited Mar 10 '15 at 04:51

answered Mar 10 '15 at 04:38

akrun

874,273
37
540
662

It's a bit non elegant when you have many columns. I would have to use regex=paste(rep("(.)", n), collapse=""). But it does the job! Thank you! – vitor Mar 10 '15 at 04:44
1

@vitor Updated with a possible `separate` solution. – akrun Mar 10 '15 at 04:45

G. Grothendieck · Answer 2 · 2019-06-15T23:44:19.033

3

sep can be an integer vector. It would be sufficient to use sep=1:4 but the 5 works too and it looks a bit better.

df %>% separate(sequence, into = paste0("V", 1:5), sep = 1:5)

giving:

  category V1 V2 V3 V4 V5
1        X  A  A  T  .  G
2        Y  C  C  G  -  T

edited Jun 15 '19 at 23:44

answered Jun 15 '19 at 23:11

G. Grothendieck

254,981
17
203
341

Very interesting approach! Do you have a reference that describes how does it works? – yuk Aug 16 '22 at 16:18
Try the help file: `?separate` – G. Grothendieck Aug 16 '22 at 17:14
Ah, missed the last paragraph for `sep` somehow. – yuk Aug 16 '22 at 20:29

Separate a column into multiple columns using tidyr::separate with sep=""

2 Answers2

Linked

Related