How to extract all numbers in a string as a vector

Question

is there any way to extract all numbers in a string as a vector? I have a large dataset which doesn't follow any specific pattern, so using the extract + regex pattern won't necessarily extract all numbers. So for example for each row of data frame shown below:

c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
"3.2 - $100000")

[1] "3.2% 1ST $100000 AND 1.1% BALANCE"                                
[2] "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY"
[3] "$4000"                                                            
[4] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[6] "3.2 - $100000"

I want to have an output like:

[1] "3.2 100000 1.1"                                
[2] "3.3 100000 1.2 3000"
[3] "4000"                                                            
[4] "3.3 100000 1.2 "                                
[5] "3.3 100000 1.2 "                                
[6] "3.2 100000 "

I had a look at resources and found this link:https://statisticsglobe.com/extract-numbers-from-character-string-vector-in-r

regmatches(x, gregexpr("[[:digit:]]+", x))

it seems that the above function works but it's not capable of doing this task on all sorts of numbers at once. I understand that "[[:digit:]]+" only look for integer numbers but how we can change this so that it covers all sorts of numbers?

See [Regular expression for floating point numbers](https://stackoverflow.com/questions/12643009/regular-expression-for-floating-point-numbers) — Wiktor Stribiżew, Nov 12 '20 at 22:23

akrun · Accepted Answer · 2020-11-12T22:06:16.447

3

We need to add the . also in the matching pattern

sapply(regmatches(x, gregexpr("\\b[[:digit:].]+\\b", x)), paste, collapse= ' ')
#[1] "3.2 100000 1.1"    
#[2] "3.3 100000 1.2 3000" 
#[3] "4000"              
#[4] "3.3 100000 1.2"   
#[5] "3.3 100000 1.2"     
#[6] "3.2 100000"

edited Nov 12 '20 at 22:06

answered Nov 12 '20 at 22:02

akrun

874,273
37
540
662

Thanks @akrun, but then it extracts 1 in 1ST. I am only looking for pure numbers – Ross_you Nov 12 '20 at 22:04
@Roozbeh_you sorry, updated with word boundary. Can you please check – akrun Nov 12 '20 at 22:07

score 3 · Answer 2 · answered Nov 12 '20 at 22:23

3

Akrun answer is perfect, but just to add another solution, using a package to create regular expressions patterns that I recently found.

library(stringr)
library(rebus)
library(magrittr)

pattern = one_or_more(DIGIT) %R% optional(DOT) %R% optional(one_or_more(DIGIT))

str_remove(x, "1ST") %>% 
str_match_all( pattern = pattern) %>% 
  lapply( function(x) paste(as.vector(x), collapse = " ")) %>% 
  unlist()

answered Nov 12 '20 at 22:23

Johan Rosa

2,797
10
18

Thanks Johan. Your answer also is correct; however, one point I should add is that not all strings have **1ST** and some of them have for instance **1-ST**. this can cause issues where you are using `str-remove` I guess, right? – Ross_you Nov 12 '20 at 23:54

score 1 · Answer 3 · answered Nov 13 '20 at 03:07

You can use negative lookahead regex :

stringr::str_extract_all(x, '\\d+(\\.\\d+)?(?![A-Z])')

#[[1]]
#[1] "3.2"    "100000" "1.1"   

#[[2]]
#[1] "3.3"    "100000" "1.2"    "3000"  

#[[3]]
#[1] "4000"

#[[4]]
#[1] "3.3"    "100000" "1.2"   

#[[5]]
#[1] "3.3"    "100000" "1.2"   

#[[6]]
#[1] "3.2"    "100000"

If you want the output as one string :

sapply(stringr::str_extract_all(x, '\\d+(\\.\\d+)?(?![A-Z])'), paste, collapse = ' ')
#[1] "3.2 100000 1.1"      "3.3 100000 1.2 3000" "4000"               
#[4] "3.3 100000 1.2"      "3.3 100000 1.2"      "3.2 100000"

How to extract all numbers in a string as a vector

3 Answers3