I have a data frame like the one shown below:
df <- data.frame(col = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY",
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2% 1ST $100000 1.1% BALANCE","2.1% 1ST $100000 AND 1.2% BALANCE PLUS $2500"))
col
1 3.2% 1ST $100000 AND 1.1% BALANCE
2 3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY
3 $4000
4 3.3% 1ST $100000 AND 1.2% BALANCE
5 3.3% 1ST $100000 AND 1.2% BALANCE
6 3.2% 1ST $100000 1.1% BALANCE
7 2.1% 1ST $100000 AND 1.2% BALANCE PLUS $2500
What I wanted to do is to separate numbers in these string and put them in a different columns of the new data frame. As @Ronak Shah recommended here:How to find a pattern in a string and extract it as a new column of data frame
I used this method, which works perfectly:
library(tidyverse)
a<-df %>%
extract(col, c('First', 'cut-off', 'Second'),
'(\\d+.*?)% 1ST\\s*\\$(\\d+).*?(\\d+.*?)%.*?', remove = FALSE) %>%
mutate(Bonus = str_extract(col, '\\d+(?=\\sBONUS)'))
However, I just realized that sometimes for some reason, the word BONUS is not mentioned in the comments while the number is actually a BONUS. For example in this string 2.1% 1ST $100000 AND 1.2% BALANCE PLUS $2500
the forth number is BONUS but it is not followed by the word "BONUS", so the number can't be captured.
I am wondering if there is any way to solve this? is there any way to extract the fourth number of the string? It seems that in most cases this BONUS number is the fourth number of my string.