split character data into numbers and letters

Question

I have a vector of character data. Most of the elements in the vector consist of one or more letters followed by one or more numbers. I wish to split each element in the vector into the character portion and the number portion. I found a similar question on Stackoverflow.com here:

split a character from a number with multiple digits

However, the answer given above does not seem to work completely in my case or I am doing something wrong. An example vector is below:

my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")

# I can obtain the number portion using:
gsub("[^[:digit:]]", "", my.data)

# However, I cannot obtaining the character portion using:
gsub("[:digit:]", "", my.data)

How can I obtain the character portion? I am using R version 2.14.1 on a Windows 7 64-bit machine.

perhaps you need to use double-`[`: `gsub("[[:digit:]]", "", my.data)` — kohske, Mar 18 '12 at 05:57

score 32 · Answer 1 · answered Dec 06 '17 at 11:20

32

Since none of the previous answers use tidyr::separate here it goes:

library(tidyr)

df <- data.frame(mycol = c("APPLE348744", "BANANA77845", "OATS2647892", "EGG98586456"))

df %>%
  separate(mycol, 
           into = c("text", "num"), 
           sep = "(?<=[A-Za-z])(?=[0-9])"
           )

answered Dec 06 '17 at 11:20

meriops

997
7
6

This is awesome, can you detail the regular expression you are using? I was trying to use "[0-9]" but obviously it removes all the numbers after the letters – Matias Andina Dec 04 '18 at 21:55
10

`?<=` is "look behind" : here it basically matches any uppercase or lowercase letter (`[A-Za-z]`) which is "before the cursor". And `?=` is "look ahead" : it matches any number ([`0-9`]) "after the cursor". None of these two "moves the cursor" so put together they match the "in between" the letter and numbers, ie where we want to split. See [here](http://userguide.icu-project.org/strings/regexp) for more on the ICU regex. – meriops Dec 06 '18 at 07:36

score 25 · Accepted Answer · answered Mar 18 '12 at 05:57

25

For your regex you have to use:

gsub("[[:digit:]]","",my.data)

The [:digit:] character class only makes sense inside a set of [].

answered Mar 18 '12 at 05:57

mathematical.coffee

55,977
11
154
194

score 19 · Answer 3 · edited May 23 '17 at 12:02

With stringr, if you like (and slightly different from the answer to the other question):

# load library
library(stringr)
#
# load data
my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
#
# extract numbers only
my.data.num <- as.numeric(str_extract(my.data, "[0-9]+"))
#
# check output
my.data.num
[1]  NA  11  21 101 111   1   1  20  13
#
# extract characters only
my.data.cha <- (str_extract(my.data, "[aA-zZ]+"))
# 
# check output
my.data.cha
[1] "aaa" "b"   "b"   "b"   "b"   "ccc" "ddd" "ccc" "ddd"

Tim Biegeleisen · Answer 4 · 2017-12-06T08:52:28.743

Late answer, but another option is to use strsplit with a regex pattern which uses lookarounds to find the boundary between numbers and letters:

var <- "ABC123"
strsplit(var, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)
[[1]]
[1] "ABC" "123"

The above pattern will match (but not consume) when either the previous character is a letter and the following character is a number, or vice-versa. Note that we use strsplit in Perl mode to access lookarounds.

Demo

score 6 · Answer 5 · answered Nov 27 '17 at 17:05

A slightly more elegant way (without any external packages):

> x = c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
> gsub('\\D','', x)       # replaces non-digits with blancs
[1] ""    "11"  "21"  "101" "111" "1"   "1"   "20"  "13" 
> gsub('\\d','', x)       # replaces digits with blanks
[1] "aaa" "b"   "b"   "b"   "b"   "ccc" "ddd" "ccc" "ddd"

score 1 · Answer 6 · answered Nov 27 '17 at 17:40

You can also use colsplit from reshape2 to split your vector into character and digit columns in one step:

library(reshape2)

colsplit(my.data, "(?<=\\p{L})(?=[\\d+$])", c("char", "digit"))

Result:

  char digit
1  aaa    NA
2    b    11
3    b    21
4    b   101
5    b   111
6  ccc     1
7  ddd     1
8  ccc    20
9  ddd    13

Data:

my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")

score 0 · Answer 7 · edited May 02 '20 at 06:33

0

mydata.nub<-gsub("\ \ D","",my.data)

mydata.text<-gsub("\ \ d","",my.data)

This one is perfect, and it also separates number and text, even if there is number between the text.

edited May 02 '20 at 06:33

David Buck

3,752
35
31
35

answered May 02 '20 at 05:59

sojan

1
1

Peter · Answer 8 · 2021-12-11T10:02:09.543

In case the result should be reassigned to a single splitted string:

var <- "foo123 bar1987"
rpaste(strsplit(var, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)[[1]], collapse = ' ')

Result:

"foo 123 bar 1987"

Or for a vectorized version where you want to reassign to a data frame:

df = data.frame(text=c("foo121", "131bar foo1516"))
res = strsplit(df$text, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)
df$res = sapply(res, paste, collapse=" ")

Result:

            text              res
1         foo121          foo 121
2 131bar foo1516 131 bar foo 1516

split character data into numbers and letters

8 Answers8

Demo

Linked

Related