4

I know there are some answers here about splitting a string every nth character, such as this one and this one, However these are pretty question specific and mostly related to a single string and not to a data frame of multiple strings.


Example data

df <- data.frame(id = 1:2, seq = c('ABCDEFGHI', 'ZABCDJHIA'))

Looks like this:

  id       seq
1  1 ABCDEFGHI
2  2 ZABCDJHIA

Splitting on every third character

I want to split the string in each row every thrid character, such that the resulting data frame looks like this:

id  1   2   3
1   ABC DEF GHI
2   ZAB CDJ HIA

What I tried

I used the splitstackshape before to split a string on a single character, like so: df %>% cSplit('seq', sep = '', stripWhite = FALSE, type.convert = FALSE) I would love to have a similar function (or perhaps it is possbile with cSplit) to split on every third character.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • 2
    Related: [Chopping a string into a vector of fixed width character elements](https://stackoverflow.com/questions/2247045/chopping-a-string-into-a-vector-of-fixed-width-character-elements) – markus May 26 '19 at 17:55

2 Answers2

4

An option would be separate

library(tidyverse)
df %>%
    separate(seq, into = paste0("x", 1:3), sep = c(3, 6))
# id  x1  x2  x3
#1  1 ABC DEF GHI
#2  2 ZAB CDJ HIA

If we want to create it more generic

n1 <- nchar(as.character(df$seq[1])) - 3
s1 <- seq(3, n1, by = 3)
nm1 <- paste0("x", seq_len(length(s1) +1))
df %>% 
    separate(seq, into = nm1, sep = s1)

Or using base R, using strsplit, split the 'seq' column for each instance of 3 characters by passing a regex lookaround into a list and then rbind the list elements

df[paste0("x", 1:3)] <- do.call(rbind, 
           strsplit(as.character(df$seq), "(?<=.{3})", perl = TRUE))

NOTE: It is better to avoid column names that start with non-standard labels such as numbers. For that reason, appended 'x' at the beginning of the names

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Okay that's smart ;) thankyou! much clearer than the other answers I have read so far – CodeNoob May 26 '19 at 17:52
  • How can the tidyverse one be made more generic, because now I have to provide both the column names as well as the split points – CodeNoob May 26 '19 at 18:01
  • 1
    @CodeNoob You can use `seq` to create the split points i.e. `seq(3, nchar(seq)-3, by = 3)` – akrun May 26 '19 at 18:02
1

You can split a string each x characters in base also with read.fwf (Read Fixed Width Format Files), which needs either a file or a connection.

read.fwf(file=textConnection(as.character(df$seq)), widths=c(3,3,3))

   V1  V2  V3
1 ABC DEF GHI
2 ZAB CDJ HIA
GKi
  • 37,245
  • 2
  • 26
  • 48