Extract numeric part of strings of mixed numbers and characters in R

Question

I have a lot of strings, and each of which tends to have the following format: Ab_Cd-001234.txt I want to replace it with 001234. How can I achieve it in R?

score 41 · Answer 1 · answered Mar 17 '13 at 03:35

41

The stringr package has lots of handy shortcuts for this kind of work:

# input data following @agstudy
data <-  c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')

# load library
library(stringr)

# prepare regular expression
regexp <- "[[:digit:]]+"

# process string
str_extract(data, regexp)

Which gives the desired result:

  [1] "001234" "001234"

To explain the regexp a little:

[[:digit:]] is any number 0 to 9

+ means the preceding item (in this case, a digit) will be matched one or more times

This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

answered Mar 17 '13 at 03:35

Ben

41,615
18
132
227

This method doesn't handle commas correctly str_extract(c("$555","$6,077"),regexp ) [1] "555" "6" – MatthewR Nov 17 '16 at 21:21
where z is your vector - added this first z <- gsub( "," , "" z ) and this worked for me! – MatthewR Nov 17 '16 at 21:27

agstudy · Accepted Answer · 2013-03-16T16:10:20.520

30

Using gsub or sub you can do this :

 gsub('.*-([0-9]+).*','\\1','Ab_Cd-001234.txt')
"001234"

you can use regexpr with regmatches

m <- gregexpr('[0-9]+','Ab_Cd-001234.txt')
regmatches('Ab_Cd-001234.txt',m)
"001234"

EDIT the 2 methods are vectorized and works for a vector of strings.

x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
sub('.*-([0-9]+).*','\\1',x)
"001234" "001234"

 m <- gregexpr('[0-9]+',x)
> regmatches(x,m)
[[1]]
[1] "001234"

[[2]]
[1] "001234"

edited Mar 16 '13 at 16:10

answered Mar 16 '13 at 15:57

agstudy

119,832
17
199
261

1

in your first solution, what does the '\\1' do in gsub? – user288609 Mar 16 '13 at 16:29
\\1 refer to accessing the first matching capture ( what is between parenthesis in the pattern). – agstudy Mar 16 '13 at 16:41

Tyler Rinker · Answer 3 · 2013-03-16T16:21:58.057

4

You could use genXtract from the qdap package. This takes a left character string and a right character string and extracts the elements between.

library(qdap)
genXtract("Ab_Cd-001234.txt", "-", ".txt")

Though I much prefer agstudy's answer.

EDIT Extending answer to match agstudy's:

x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
genXtract(x, "-", ".txt")

# $`-  :  .txt1`
# [1] "001234"
# 
# $`-  :  .txt2`
# [1] "001234"

edited Mar 16 '13 at 16:21

answered Mar 16 '13 at 16:05

Tyler Rinker

108,132
65
322
519

+1. always like learning of new packages. (eh, "new" to me, as the saying goes) – Ricardo Saporta Mar 16 '13 at 16:16

G. Grothendieck · Answer 4 · 2013-03-21T14:08:58.547

gsub Remove prefix and suffix:

gsub(".*-|\\.txt$", "", x)

tools package Use file_path_sans_ext from tools to remove extension and then use sub to remove prefix:

library(tools)
sub(".*-", "", file_path_sans_ext(x))

strapplyc Extract the digits after - and before dot. See gsubfn home page for more info:

library(gsubfn)
strapplyc(x, "-(\\d+)\\.", simplify = TRUE)

Note that if it were desired to return a numeric we could use strapply rather than strapplyc like this:

strapply(x, "-(\\d+)\\.", as.numeric, simplify = TRUE)

score 1 · Answer 5 · answered Jun 21 '21 at 18:16

I'm adding this answer because it works regardless of what non-numeric characters you have in the strings you want to clean up, and because OP said that the string tends to follow the format "Ab_Cd-001234.txt", which I take to mean allows for variation.

Note that this answer takes all numeric characters from the string and keeps them together, so if the string were "4_Ab_Cd_001234.txt", your result would be "4001234".

If you're wanting to point your solution at a column in a dataframe you've got,

df$clean_column<-gsub("[^0-9]", "", df$dirty_column)

This is very similar to the answer here: https://stackoverflow.com/a/52729957/9731173.

Essentially what you are doing with my solution is replacing any non-numeric character with "", while the answer I've linked to replaces any character that is not numeric, - or .

Extract numeric part of strings of mixed numbers and characters in R

5 Answers5

Linked

Related