1

I collected from the web a dataset that contains a set of strings that follows a pattern, example:

string <- c("<option value="AÉCIO NEVES|1117315%23221!MG=PSDB?74646">AÉCIO NEVES</option>", 
"<option value="KIM KATAGUIRI|1117562%23366!SP=DEM?204536">KIM KATAGUIRI</option>")

But I just want to extract those numbers that are between ? and >.

In this example, I want to extract 74646 and 204536. Is there a way to collect those numbers automatically and then putting them in a new data frame?

jazzurro
  • 23,179
  • 35
  • 66
  • 76
John P. S.
  • 367
  • 3
  • 17

1 Answers1

1

You can extract numbers in various ways. For example, using the stringi package, you can handle the task. I used positive lookahead/lookbehind in regular expression. I extracted numbers that are preceded by ? and followed by ".

string <- c("<option value=\"AÉCIO NEVES|1117315%23221!MG=PSDB?74646\">AÉCIO NEVES</option>", 
            "<option value=\"KIM KATAGUIRI|1117562%23366!SP=DEM?204536\">KIM KATAGUIRI</option>")


unlist(stri_extract_all_regex(str = string, pattern = "(?<=\\?)[0-9]+(?=\")"))

#[1] "74646"  "204536"
jazzurro
  • 23,179
  • 35
  • 66
  • 76