split strings in n columns of a data frame in R

Question

I have the following data frame (actual data has a larger number of columns):

df <- data.frame(
l1=c(ind1='000000',ind2='100100'), 
l2=c(ind1='200204',ind2='124124'), 
l3=c(ind1='400204',ind2='124124'))

In R I would like to split each column into two of length 3. Column names don't matter as long as the original order is conserved. My desired output therefore is:

ind1 000 000 200 204 400 204
ind2 100 100 124 124 124 124

I did find some pointers as to how this could work so I made a function based on one of the answers found in this SO post.

splitGT <- function(x) {
  return(strsplit(x, "(?<=.{3})", perl=TRUE)[[1]])
}

While this does the splitting correctly, the result when applying it to the dataframe is an array separated by the original columns:

apply(df, c(1,2), splitGT)

, , l1

     ind1  ind2 
[1,] "000" "100"
[2,] "000" "100"

, , l2

     ind1  ind2 
[1,] "200" "124"
[2,] "204" "124"

, , l3

     ind1  ind2 
[1,] "400" "124"
[2,] "204" "124"

I managed to get past this with adply but this produced a data frame with two lines per ind and one column per original column. While this is closer to what I need I feel like I am missing something very obvious as this appears way too complicated to me.

adply(apply(df, c(1,2), splitGT), c(1, 2))

  X1   X2    l1     l2     l3
1  1 ind1    000    200    400
2  2 ind1    000    204    204
3  1 ind2    100    124    124
4  2 ind2    100    124    124

score 2 · Accepted Answer · answered Nov 21 '19 at 15:51

One option is to use str_extract_all and then bind the columns together to get your desired output. Keep in mind, the outut is a character matrix. Should be easy to work with though. Let me know if you have any issues!

library(stringr)

do.call(cbind, lapply(df, str_extract_all, ".{3}", simplify = T))
     [,1]  [,2]  [,3]  [,4]  [,5]  [,6] 
[1,] "000" "000" "200" "204" "400" "204"
[2,] "100" "100" "124" "124" "124" "124"

# or you could use `dplyr::bind_cols()` with a slight adjustment, keep in mind this 
# is now a dataframe with factor columns. But you could easily work with it, too

library(dplyr)

bind_cols(lapply(df, function(x) as.data.frame(str_extract_all(x, ".{3}", simplify = T))))
   V1  V2 V11 V21 V12 V22
1 000 000 200 204 400 204
2 100 100 124 124 124 124

This works like a charm, thanks. I just converted the output to a dataframe and got the rownames from the original dataframe. I knew the route I was taking was too complicated, I struggle quite a bit with the whole apply family of functions — Sam, Nov 21 '19 at 15:59

score 1 · Answer 2 · answered Nov 21 '19 at 16:11

Here are two ways with some tidyr reshaping. Based on the recent pivot_* upgrades, you also get nice column names based on whether a value is the first half of the string or the second half, which helps make sure things stay in order. Both should scale well to larger data frames or more complicated operations.

First version: separate all strings in "l*" columns at the non-boundary between the first 3 digits and the last 3 digits. After marking off rows, reshape to wide based on those "l*" columns.

library(dplyr)
library(tidyr)

df %>%
  tibble::rownames_to_column("id") %>%
  separate_rows(starts_with("l"), sep = "(?<=^\\d{3})\\B(?=\\d{3}$)") %>%
  group_by(id) %>%
  mutate(row = row_number())%>%
  pivot_wider(names_from = row, values_from = starts_with("l"))
#> # A tibble: 2 x 7
#> # Groups:   id [2]
#>   id    l1_1  l1_2  l2_1  l2_2  l3_1  l3_2 
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ind1  000   000   200   204   400   204  
#> 2 ind2  100   100   124   124   124   124

Second version: put all the "l*" columns into one long column and split with the same regex. Then reshape to wide with columns made by the combination of the "l*" markers and row numbers.

df %>%
  tibble::rownames_to_column("id") %>%
  pivot_longer(starts_with("l")) %>%
  separate_rows(value, sep = "(?<=^\\d{3})\\B(?=\\d{3}$)") %>%
  group_by(id, name) %>%
  mutate(row = row_number()) %>%
  pivot_wider(names_from = c(name, row))
# same output as previous

split strings in n columns of a data frame in R

2 Answers2