How to find a pattern in a string and extract it as a new column of data frame

Question

I have a data frame as shown below:

c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
"$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
"3.2% 1ST $100000 1.1% BALANCE")

[1] "3.2% 1ST $100000 AND 1.1% BALANCE"                                
[2] "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY"
[3] "$4000"                                                            
[4] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"                                
[6] "3.2% 1ST $100000 1.1% BALANCE"

Generally, I can say that all comments are in this form: x.y% 1ST abcdef AND a.b% BALANCE (it could have a fourth number like a bonus. This data frame has 52000 rows, so definitely it's not possible to capture all sorts of comments but in general, the format I mentioned above is a good starting format). I am wondering how I can extract each number in these comments and save them in a data frame format. For example, I want to have a data frame like this:

    First%   cut-off second%  Bonus 
1    3.2     100000    1.1      NA
2    3.3     100000    1.2      3000
3    NA        NA      NA       NA
4    3.3     100000    1.2      NA
5    3.3     100000    1.2      NA
6    3.2     100000    1.1      NA

I can probably use str_subset and a pattern like this:

str_subset(data$comment, "(\\d\\.\\d)% 1ST \\$(\\d{3,8}) AND (\\d\\.\\d)% BALANCE \\w+")

but str_subset returns the string completly and I am not sure how I can save each part individually. I also checked this link Extract a regular expression match, it seems that str_extract is a good option; however, for only checking one pattern. In my case, if for instance I extract \\d\\.\\d%, it's not clear if the first number will be extracted or the second one.

Thanks

Ronak Shah · Accepted Answer · 2020-11-11T04:07:12.753

4

You can try the following :

library(tidyverse)

df %>%
  extract(col, c('First', 'cut-off', 'Second'), 
               '(\\d+.*?)% 1ST\\s*\\$(\\d+).*?(\\d+.*?)%.*?', remove = FALSE) %>%
  mutate(Bonus = str_extract(col, '\\d+(?=\\sBONUS)')) %>%
  select(-col)

#   First cut-off Second Bonus
#1   3.2  100000    1.1  <NA>
#2   3.3  100000    1.2  3000
#3  <NA>    <NA>   <NA>  <NA>
#4   3.3  100000    1.2  <NA>
#5   3.3  100000    1.2  <NA>
#6   3.2  100000    1.1  <NA>

data

df <- data.frame(col = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
                         "$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
                         "3.2% 1ST $100000 1.1% BALANCE"))

edited Nov 11 '20 at 04:07

answered Nov 11 '20 at 03:57

Ronak Shah

377,200
20
156
213

Thanks @Ronak Shah. Can you please explain how this line of code is working? So, I am guessing that the way the extract function is working is that it automatically extracts part of the pattern that is in `()`, is it correct? what I am curious about is that how you assign the first found a match to the first column and the secon one to cut-off column? why did you use `str_extract` for the BONUS part? – Ross_you Nov 11 '20 at 04:04
1

Yes, that is correct. So all the values to be extracted are in capture groups `()`. First capture group is assigned to first column i.e `First`, second to `cut-off` and so on. Since BONUS part is not present in all the values I use `str_extract` to get it. – Ronak Shah Nov 11 '20 at 04:08

Karthik S · Answer 2 · 2020-11-11T04:14:21.130

1

Does this work:

st <- c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
        "$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
        "3.2% 1ST $100000 1.1% BALANCE")
dat <- plyr::ldply(str_extract_all(st, '[0-9]\\.[0-9]|[0-9]{4,6}'), rbind)
names(dat) <- c('First%', 'cut-off', 'second%',  'Bonus')
dat <- dat %>% type.convert(as.is = T)
dat$`First%`[dat$`First%`>10] <- NA
dat
  First% cut-off second% Bonus
1    3.2  100000     1.1    NA
2    3.3  100000     1.2  3000
3     NA      NA      NA    NA
4    3.3  100000     1.2    NA
5    3.3  100000     1.2    NA
6    3.2  100000     1.1    NA

edited Nov 11 '20 at 04:14

answered Nov 11 '20 at 04:07

Karthik S

11,348
2
11
25

Thanks @Karthik. For some reason, this didn't work for me. I passed the data frame @Ronak Shah, created to it and it didn't return the expected result. What is `st` in your code? is it a single string? – Ross_you Nov 11 '20 at 04:12
@Roozbeh_you, sorry, I forgot to add the 'st' part, it's just the string in question. Have edited the answer. – Karthik S Nov 11 '20 at 04:14

score 1 · Answer 3 · answered Nov 11 '20 at 04:15

I went with a dplyr pipe. The result isn't perfect, but it's a smoother way to fiddle with the regex matches piece by piece instead of a long string.

library(tidyverse)

data_ <- data.frame(dat = c("3.2% 1ST $100000 AND 1.1% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE AND $3000 BONUS FULL PRICE ONLY", 
            "$4000", "3.3% 1ST $100000 AND 1.2% BALANCE", "3.3% 1ST $100000 AND 1.2% BALANCE", 
            "3.2% 1ST $100000 1.1% BALANCE"))

data_ %>%
  mutate(first_perc = str_extract(dat, "\\d+\\.\\d%?"),
         cut_off = str_extract(dat, "[^\\d+\\.\\d% 1ST.]\\d+"),
         second_perc = str_extract(dat, "[^\\d+\\.\\d% 1ST?\\d+]?\\d+\\.\\d%"),
         bonus = str_extract(dat, "[^\\d+\\.\\d% 1ST?\\d+\\d+\\.\\d%\\D+]?\\d\\d+"))

I might suggest using conditionals to go back for the incomplete rows such as a case_when argument based on what did or did not get caught by the various mutated rows.

Thanks @Kenr Orr. This solution is also valid and more importantly simple. I appreciate your effort — Ross_you, Nov 12 '20 at 18:38

How to find a pattern in a string and extract it as a new column of data frame

3 Answers3

Linked