How to split a data frame column with no defined delimiter

Question

How would I split this column 'seriesID' into multiple columns to look like the table below? Basically I need to split the string into multiple strings with lengths (3,3,6,1,1,3).

  seriesID
1 ISU111aaaaaa33001
2 ISU222bbbbbb33001
3 ISU000cccccc63001
4 ISU333dddddd63001


seriesID             pre  supp  ind     data  case  area
1 ISU111aaaaaa33001  ISU  111   aaaaaa  3     3     001
2 ISU222bbbbbb33001  ISU  222   bbbbbb  3     3     001
3 ISU000cccccc63001  ISU  000   cccccc  6     3     001
4 ISU333dddddd63001  ISU  333   dddddd  6     3     001

Thanks!

HubertL · Answer 1 · 2017-03-16T20:44:50.900

You can use separate from package tidyr:

df <- data.frame(series=c("ISU00000000033001","ISU00000000033001","ISU00000000063001","ISU00000000063001"), stringsAsFactors=FALSE)

library(tidyr)
df %>%
  separate(series, 
           c("pre", "supp", "ind", "data", "case", "area"), 
           sep=cumsum(c(3,3,6,1,1)))

  pre supp    ind data case area
1 ISU  000 000000    3    3  001
2 ISU  000 000000    3    3  001
3 ISU  000 000000    6    3  001
4 ISU  000 000000    6    3  001

score 2 · Answer 2 · answered Mar 16 '17 at 20:35

You can also use substr:

widths = c(3,3,6,1,1,3)
end = cumsum(widths)
start = c(1, head(end, -1) + 1)

as.data.frame(mapply(substr, start, end, MoreArgs = list(x=df$seriesID)))

#   V1  V2     V3 V4 V5  V6
#1 ISU 000 000000  3  3 001
#2 ISU 000 000000  3  3 001
#3 ISU 000 000000  6  3 001
#4 ISU 000 000000  6  3 001

score 1 · Answer 3 · answered Mar 16 '17 at 20:26

You could "re-read" your data as a fixed-wdith file using readr. For example

series=c("ISU00000000033001","ISU00000000033001","ISU00000000063001","ISU00000000063001")

read_fwf(paste(series, collapse="\n"), fwf_widths(c(3,3,6,1,1,3)))
# A tibble: 4 × 6
#      X1    X2     X3    X4    X5    X6
#   <chr> <chr>  <chr> <int> <int> <chr>
# 1   ISU   000 000000     3     3   001
# 2   ISU   000 000000     3     3   001
# 3   ISU   000 000000     6     3   001
# 4   ISU   000 000000     6     3   001

Note that we collapse the string vector into a single string with new-lines which may be inefficient for large vectors.

score 1 · Accepted Answer · answered Mar 16 '17 at 20:29

seriesID <- c('ISU00000000033001',
          'ISU00000000033001',
          'ISU00000000063001',
          'ISU00000000063001')



df <- data.frame(pre = substr(seriesID,1,3), 
             supp =substr(seriesID,4,6),
             ind =substr(seriesID,7,12),
             data =substr(seriesID,13,13),
             case =substr(seriesID,14,14),
             area =substr(seriesID,15,17))

df


pre supp    ind data case area
1 ISU  000 000000    3    3  001
2 ISU  000 000000    3    3  001
3 ISU  000 000000    6    3  001
4 ISU  000 000000    6    3  001

score 0 · Answer 5 · answered Mar 16 '17 at 20:31

It sounds like your should really be handling this when you read the data using something like read.fwf(): https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.fwf.html.

But to solve the problem posed, just use substr()

seriesID <- c('ISU00000000033001', 'ISU00000000033001', 'ISU00000000063001', 'ISU00000000063001')

df <- data.frame(seriesID = seriesID,
    pre = substr(seriesID, 1, 3),
    supp = substr(seriesID, 4, 6),
    ind = substr(seriesID, 7, 12),
    data = substr(seriesID, 13, 13),
    case = substr(seriesID, 14, 14),
    area = substr(seriesID, 15, 17))

print(df)
#            seriesID pre supp    ind data case area
# 1 ISU00000000033001 ISU  000 000000    3    3  001
# 2 ISU00000000033001 ISU  000 000000    3    3  001
# 3 ISU00000000063001 ISU  000 000000    6    3  001
# 4 ISU00000000063001 ISU  000 000000    6    3  001

How to split a data frame column with no defined delimiter

5 Answers5