1

I am working in the R programming language and would appreciate some help with formulating regular expressions.

I have a setup where I am accepting a list of numbers from the user as a string and I want to extract all the numbers from the string into a numeric vector. I have informed the user to provide the numbers to me as being separated by commas. But I can't expect the user to respect that. Thus I want to extract the numbers even if they are separating by spaces or semicolons or something weird.

I want to be able to extract all real numbers from the string even if the numbers are negative (ex. -5) or contain a decimal (ex. 5.5) or are in scientific notation (ex 5.5e-5, 5.5E-5, 5.5e+5, 5.5E+5, 5.5e5, 5.5E5)

I was reading a forum on a similar question and identified regex that could extract numbers from a string, but I realized that it doesn't work for negative numbers or decimals or scientific notation. I would like to able to handle all.

Using this regular expression I am able to extract real whole numbers from a string separated by spaces or commas or even semi-colons. 

    # Using this string works 
    this_string = "1, 2  3, 5, 7, 10, 11, 12; 18" 
    extracted_numbers = as.numeric(regmatches(this_string, gregexpr("[0-9]+", this_string))[[1]])
    print(extracted_numbers)

Extracted Result: [1] 1 2 3 5 7 10 11 12 18

But the same regular expression does not work on this more complex string with negative numbers, scientific notation, and decimals.

this_string = "-1, 0, 5e-1 ; 7E-1, 2  3.0, 4, 5.33e+2"

Extracted Result: [1] 1 0 5 1 7 1 2 3 0 4 5 33 2

A correct extraction of numbers from the string should yield:

Desired Extracted Result: [1] -1.0 0.0 0.5 0.7 2.0 3.0 4.0 533.0

Thanks so much for your help.

Edit: I just found a viable solution:

this_string = "-1, 0, 5e-1 ; 7E-1, 2  3.0, 4, 5.33e+2" 
extracted_numbers = as.numeric(regmatches(this_string, gregexpr("[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?", this_string))[[1]])
print(extracted_numbers)

user Wojciech Sobala provided an answer with the above regular expression in this question: Extracting decimal numbers from a string

Thanks Wojciech.

xyz123
  • 651
  • 4
  • 19
  • Can [this SO post](https://stackoverflow.com/questions/33084563/r-regular-expression-scientific-notation) help? – Rui Barradas Sep 03 '22 at 06:50
  • I am trying the all numbers regex and it is not extracting anything. I think it is very helpful, but I still confused what's not working. extracted_numbers = as.numeric(regmatches(this_string, gregexpr("^(-?[0-9]*)((\\.?[0-9]+[eE]?[-\\+]?[0-9]+)|(\\.[0-9]+))*$", this_string))[[1]]) – xyz123 Sep 03 '22 at 06:57
  • 1
    If you do want to stick with this quite convoluted pattern, then at least use `str_extract_all`from `stringr`: `lapply(str_extract_all(this_string,"[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"), as.numeric)` – Chris Ruehlemann Sep 04 '22 at 05:38
  • Please see edited answer, have simplified regex – Chris Ruehlemann Sep 04 '22 at 06:11

1 Answers1

2

Is this what you need?

library(tidyverse)
data.frame(this_string) %>%
  mutate(
    # split strings and convert to numeric:
    this_string = lapply(
      # split strings:
      str_split(this_string, ",\\s|\\s;\\s|\\s+"),
            # apply `as.numeric`:
            as.numeric)
          )
                                this_string
1 -1.0, 0.0, 0.5, 0.7, 2.0, 3.0, 4.0, 533.0

If you prefer to have the results as a vector:

lapply(str_split(this_string, "",\\s|\\s;\\s|\\s+"), as.numeric)

Alternatively, instead of splitting the string by what's between the numbers, you can extract the numbers themselves, using str_extract_all:

lapply(str_extract_all(this_string,"-?\\d*\\.?\\d+([eE][+-]?\\d+)?"), as.numeric)

EDIT:

Here's an even simpler method essentially relying on the negative character class \\S, which matches any characters that are not included in the \\s character class (mostly whitespace):

lapply(str_extract_all(this_string,"(?!;)\\S+(?=,|$)"), as.numeric)

Data:

this_string = "-1, 0, 5e-1 ; 7E-1, 2  3.0, 4, 5.33e+2"
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34