I have a data frame as shown below:
c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY",
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE",
"3.2% 1ST $100000 1.1% BALANCE")
[1] "3.2% 1ST $100000 AND 1.1% BALANCE"
[2] "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY"
[3] "$4000"
[4] "3.3% 1ST $100000 AND 1.2% BALANCE"
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"
[6] "3.2% 1ST $100000 1.1% BALANCE"
Generally, I can say that all comments are in this form:
x.y% 1ST abcdef AND a.b% BALANCE
(it could have a fourth number like a bonus. This data frame has 52000 rows, so definitely it's not possible to capture all sorts of comments but in general, the format I mentioned above is a good starting format).
I am wondering how I can extract each number in these comments and save them in a data frame format. For example, I want to have a data frame like this:
First% cut-off second% Bonus
1 3.2 100000 1.1 NA
2 3.3 100000 1.2 3000
3 NA NA NA NA
4 3.3 100000 1.2 NA
5 3.3 100000 1.2 NA
6 3.2 100000 1.1 NA
I can probably use str_subset
and a pattern like this:
str_subset(data$comment, "(\\d\\.\\d)% 1ST \\$(\\d{3,8}) AND (\\d\\.\\d)% BALANCE \\w+")
but str_subset
returns the string completly and I am not sure how I can save each part individually. I also checked this link Extract a regular expression match, it seems that str_extract
is a good option; however, for only checking one pattern. In my case, if for instance I extract \\d\\.\\d%
, it's not clear if the first number will be extracted or the second one.
Thanks