How to extract numbers included in a string in order using regex

Question

I have a data frame like the one shown below:

df <- data.frame(col = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
                         "$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
                         "3.2% 1ST $100000 1.1% BALANCE","2.1% 1ST $100000 AND 1.2% BALANCE PLUS $2500"))

                                                                col
1                                 3.2% 1ST $100000 AND 1.1% BALANCE
2 3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY
3                                                             $4000
4                                 3.3% 1ST $100000 AND 1.2% BALANCE
5                                 3.3% 1ST $100000 AND 1.2% BALANCE
6                                     3.2% 1ST $100000 1.1% BALANCE
7                      2.1% 1ST $100000 AND 1.2% BALANCE PLUS $2500

What I wanted to do is to separate numbers in these string and put them in a different columns of the new data frame. As @Ronak Shah recommended here:How to find a pattern in a string and extract it as a new column of data frame

I used this method, which works perfectly:

library(tidyverse)

    a<-df %>%
      extract(col, c('First', 'cut-off', 'Second'), 
              '(\\d+.*?)% 1ST\\s*\\$(\\d+).*?(\\d+.*?)%.*?', remove = FALSE) %>%
      mutate(Bonus = str_extract(col, '\\d+(?=\\sBONUS)'))

However, I just realized that sometimes for some reason, the word BONUS is not mentioned in the comments while the number is actually a BONUS. For example in this string 2.1% 1ST $100000 AND 1.2% BALANCE PLUS $2500 the forth number is BONUS but it is not followed by the word "BONUS", so the number can't be captured. I am wondering if there is any way to solve this? is there any way to extract the fourth number of the string? It seems that in most cases this BONUS number is the fourth number of my string.

score 2 · Accepted Answer · answered Nov 12 '20 at 19:37

2

You can use

^(\d[\d.]*)%\s*1ST\s*\$(\d+)\D*(\d[\d.]*)%\D*(\d*)

See the regex demo.

In R, use

a <- df %>%
  extract(col, c('First', 'cut-off', 'Second', 'Bonus'), 
    '^(\\d[\\d.]*)%\\s*1ST\\s*\\$(\\d+)\\D*(\\d[\\d.]*)%\\D*(\\d*)', remove = FALSE)

Details

^ - start of string
(\d[\d.]*) - Group 1: a digit and then zero or more digits/dots
% - a % char
\s* - 0+ whitespaces
1ST - a string
\s* - 0+ whitespaces
\$ - a $ char
(\d+) - Group 2: one or more digits
\D* - 0+ non-digits
(\d[\d.]*) - Group 3: a digit and then zero or more digits/dots
%\D* - % and 0+ non-digits
(\d*) - Group 4: zero or more digits.

answered Nov 12 '20 at 19:37

Wiktor Stribiżew

607,720
39
448
563

Stribizew, thanks, I got my answer, just one question, for some reason in the Bonus column, when the algorithm doesn't find a match, it doesn't show NA, and it just shows and empty cell. Is there any reason for that? So, it's shown as "" in the data frame when I print it – Ross_you Nov 12 '20 at 19:47
Actually, when I had a closer look I found that in some cases it returns "NA", some cases "number" and some cases "" (empty) and I am not sure why – Ross_you Nov 12 '20 at 19:50
@Roozbeh_you I thought it would be a more elegant solution. If you need `NA` values, use a slightly different regex, `'^(\\d[\\d.]*)%\\s*1ST\\s*\\$(\\d+)\\D*(\\d[\\d.]*)%(?:\\D*(\\d+))?'`. Here, Group 4 requires at least one or more digits to match, so if there is no match, the group will not participate in the match, and its value will be NA. – Wiktor Stribiżew Nov 12 '20 at 19:50
it is still returning empty cell. Here is a few cell as an example when I prineted the column `[53] "" "" "" "" "" "" "" "" "" "" "" "" "" [66] "" "" "" "" "" "" "" "3000" "" "" "" "" "" [79] "" "" NA "" "" "" "" "" "" "" "" "" ""` – Ross_you Nov 12 '20 at 19:54
I guess for cases that are not matched with the pattern at all, it returns NA, is it correct? so for example for this string `3.3% - 1ST $100000 AND 1.2% BALANCE` or `1.5% SALE PRICE` NA was returned for all columns – Ross_you Nov 12 '20 at 19:58
1

@Roozbeh_you Looks like that. But that means you may use `'^(\\d[\\d.]*)%\\D*1ST\\s*\\$(\\d+)\\D*(\\d[\\d.]*)%\\D*(\\d*)'` (or with `(?:\\D*(\\d+))?` at the end), see [this regex demo](https://regex101.com/r/nAAItn/2). – Wiktor Stribiżew Nov 12 '20 at 19:59
Wow, that's an interesting website, thanks for sharing it. I can play with the pattern to find the best one – Ross_you Nov 12 '20 at 20:04
in the solution you provided, how can I change **1ST** to an optional term without capturing it , so the pattern you mentioned can capture `2.1% $100000 AND 1.2% BALANCE` as well? I wanted to try `(1ST?)` but I am afraid this will add one group to the capturing group and cause issue – Ross_you Nov 12 '20 at 23:58
1

@Roozbeh_you Use `'^(\\d[\\d.]*)%(?:\\D*1ST)?\\D*\\$(\\d+)\\D*(\\d[\\d.]*)%\\D*(\\d*)'`, see [regex demo](https://regex101.com/r/nAAItn/4). See [How do I make part of a regex match optional?](https://stackoverflow.com/questions/12451731/how-do-i-make-part-of-a-regex-match-optional) – Wiktor Stribiżew Nov 13 '20 at 01:08

How to extract numbers included in a string in order using regex

1 Answers1