2

is there any way to extract all numbers in a string as a vector? I have a large dataset which doesn't follow any specific pattern, so using the extract + regex pattern won't necessarily extract all numbers. So for example for each row of data frame shown below:

c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
"3.2 - $100000")

[1] "3.2% 1ST $100000 AND 1.1% BALANCE"                                
[2] "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY"
[3] "$4000"                                                            
[4] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[6] "3.2 - $100000"   

I want to have an output like:

[1] "3.2 100000 1.1"                                
[2] "3.3 100000 1.2 3000"
[3] "4000"                                                            
[4] "3.3 100000 1.2 "                                
[5] "3.3 100000 1.2 "                                
[6] "3.2 100000 "   

I had a look at resources and found this link:https://statisticsglobe.com/extract-numbers-from-character-string-vector-in-r

regmatches(x, gregexpr("[[:digit:]]+", x))

it seems that the above function works but it's not capable of doing this task on all sorts of numbers at once. I understand that "[[:digit:]]+" only look for integer numbers but how we can change this so that it covers all sorts of numbers?

Ross_you
  • 881
  • 5
  • 22
  • See [Regular expression for floating point numbers](https://stackoverflow.com/questions/12643009/regular-expression-for-floating-point-numbers) – Wiktor Stribiżew Nov 12 '20 at 22:23

3 Answers3

3

We need to add the . also in the matching pattern

sapply(regmatches(x, gregexpr("\\b[[:digit:].]+\\b", x)), paste, collapse= ' ')
#[1] "3.2 100000 1.1"    
#[2] "3.3 100000 1.2 3000" 
#[3] "4000"              
#[4] "3.3 100000 1.2"   
#[5] "3.3 100000 1.2"     
#[6] "3.2 100000"   
akrun
  • 874,273
  • 37
  • 540
  • 662
3

Akrun answer is perfect, but just to add another solution, using a package to create regular expressions patterns that I recently found.

library(stringr)
library(rebus)
library(magrittr)

pattern = one_or_more(DIGIT) %R% optional(DOT) %R% optional(one_or_more(DIGIT))

str_remove(x, "1ST") %>% 
str_match_all( pattern = pattern) %>% 
  lapply( function(x) paste(as.vector(x), collapse = " ")) %>% 
  unlist()

Johan Rosa
  • 2,797
  • 10
  • 18
  • Thanks Johan. Your answer also is correct; however, one point I should add is that not all strings have **1ST** and some of them have for instance **1-ST**. this can cause issues where you are using `str-remove` I guess, right? – Ross_you Nov 12 '20 at 23:54
1

You can use negative lookahead regex :

stringr::str_extract_all(x, '\\d+(\\.\\d+)?(?![A-Z])')

#[[1]]
#[1] "3.2"    "100000" "1.1"   

#[[2]]
#[1] "3.3"    "100000" "1.2"    "3000"  

#[[3]]
#[1] "4000"

#[[4]]
#[1] "3.3"    "100000" "1.2"   

#[[5]]
#[1] "3.3"    "100000" "1.2"   

#[[6]]
#[1] "3.2"    "100000"

If you want the output as one string :

sapply(stringr::str_extract_all(x, '\\d+(\\.\\d+)?(?![A-Z])'), paste, collapse = ' ')
#[1] "3.2 100000 1.1"      "3.3 100000 1.2 3000" "4000"               
#[4] "3.3 100000 1.2"      "3.3 100000 1.2"      "3.2 100000"  
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213