0

I'm trying to solve this regex problem:

String:

"notme notme <P abc>getme1</P> notme notme <P dfg>getme2</P> notme notme"

Want this:

"getme1 getme2"

This regex solves it if the <P> does not contain more text

"(?<=<P>)(.|\n)*?(?=</P>)"

If I write it like this, I get an error in the lookback:

"(?<=<P.*?>)(.|\n)*?(?=</P>)"

Appreciate every contribution!

Best regards
M

MariusJ
  • 71
  • 6
  • 1
    Makes me wonder if it would be easier to parse data with `rvest` package. – Roman Luštrik Nov 08 '21 at 10:51
  • `regmatches( x, gregexpr("(?<=\\

    ).+?(?=\\<)", x, perl = TRUE ) )`

    – Wimpel Nov 08 '21 at 10:53
  • Unfortunately, rvest is not an option because of some namespace error in some tags. – MariusJ Nov 08 '21 at 10:57
  • First of all, do not use `(.|\n)*?` like regex patterns, EVER, see [this YT video of mine](https://www.youtube.com/watch?v=SEobSs-ZCSE). Next, there are several solutions, but the most straight-forward is `stringr::str_match_all(x, "(?s)

    ]*>(.*?)

    ")`. Base R: `regmatches(x, gregexpr("(?s)

    ]*>\\K.*?(?=

    )", x, perl=TRUE))`. See [this answer](https://stackoverflow.com/a/39086448/3832970).
    – Wiktor Stribiżew Nov 08 '21 at 10:58
  • Thank you for your reply Wiktor. It seems like your solution includes the tags(I want them removed). Do have a solution for that? – MariusJ Nov 08 '21 at 11:03

0 Answers0