Text to columns by fixed width in R

Question

I have a large data frame in which I'm trying to separate the values from one column into two. The values are character then text such as AU2847 or AU1824. I want the first column to be AU and the second to be the corresponding 4 digit number.

I am also restricted to the base r packages so I believe strsplit will be our best bet- but can't figure out how to make it split after 2nd character and create 2 columns from it.

[See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a sample of data and all necessary code. — camille, Feb 03 '20 at 15:16

score 0 · Accepted Answer · answered Feb 03 '20 at 15:07

0

I regularly use these two functions:

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}

and

substrLeft <- function(x, n){
  substr(x, 1,n)
}

Which cutoff n characters left or right of the string

answered Feb 03 '20 at 15:07

SebSta

476
2
12

arg0naut91 · Answer 2 · 2020-02-03T15:41:23.273

0

You could try:

as.data.frame(
  do.call(rbind,
          strsplit(sub("^(.+?)(\\d+)", "\\1_\\2", df$col),
                   split="_")
          )
  )

Whereby df is the name of your data frame and col the name of your column.

This then inserts artificially an underscore between the text and first number - this way you can use underscore as an argument to strsplit.

edited Feb 03 '20 at 15:41

answered Feb 03 '20 at 15:15

arg0naut91

14,574
2
17
38

score 0 · Answer 3 · answered Feb 03 '20 at 15:25

There are several options to do this. You can subset by position using substr(), or you can use gsub() and call be reference too. Subsetting by position will be faster but inflexible (you would have to have a huge dataframe to notice a difference in time), and using regex (gsub() will be a little slower but is much more flexible). E.g.:

df[c("col2", "col3", "col2b", "col3b")] <- list(substr(df$col1, 1, 2),
                                                substr(df$col1, 3, 6),
                                                gsub("([[:alpha:]]+)(\\d+)", "\\1", df$col1),
                                                gsub("([[:alpha:]]+)(\\d+)", "\\2", df$col1))

df
    col1 col2 col3 col2b col3b
1 AU2847   AU 2847    AU  2847
2 AU1824   AU 1824    AU  1824

Data:

df <- data.frame(col1 = c("AU2847", "AU1824"), stringsAsFactors = F)

score 0 · Answer 4 · answered Feb 03 '20 at 23:24

We can use strsplit() together with a regular expression which uses a lookbehind assertion:

x  <- c("AU2847", "AU1824")
strsplit(x, "(?<=[A-Z]{2})", perl = TRUE)

[[1]]
[1] "AU"   "2847"

[[2]]
[1] "AU"   "1824"

The lookbehind regular expression tells strsplit() to split each string after two capital letters. There is no need to artificially introduce a character to split on as in arg0naut91's answer.

Now, the OP has mentioned that the character vector to be splitted is a column of a larger data.frame. This requires some additional code to append the list output of strsplit() as new columns to the data.frame:

Let's assume we have this data.frame

DF <- data.frame(x, stringsAsFactors = FALSE)

Now, the new columns can be appended by:

DF[, c("col1", "col2")] <- do.call(rbind, strsplit(DF$x, "(?<=[A-Z]{2})", perl = TRUE))
DF

       x col1 col2
1 AU2847   AU 2847
2 AU1824   AU 1824

Text to columns by fixed width in R

4 Answers4