0

I'm trying to understand regular expressions:

I need to only match on text_01 and text_02 and filter out the tags.

<span>text_01<b>text_02</b>

I've tried to do it like:

(?<=<span>)(([^>]+)<b>)(.+?)(?=</b>)

But it captures 3 groups and and the Full Match includes a tag.

text_01<b>text_02

Could you give me advice on how I need to build a regex whose Full match contains only text and no tags?

Luciano van der Veekens
  • 6,307
  • 4
  • 26
  • 30
recont
  • 39
  • 6

2 Answers2

0

By using a non-capturing group you are able to exclude the middle <b> tag as a capture group, but you will never be able to get a full match without the tag included. It's not possible, a regular expression cannot skip a part while capturing. A match must be consecutive.

(?<=<span>)(.+?)(?:<b>)(.+?)(?=<\/b>)
  • Full match text_01<b>text_02
  • Group 1. text_01
  • Group 2. text_02
Luciano van der Veekens
  • 6,307
  • 4
  • 26
  • 30
0

Parsing HTML with regular expressions can get very complicated. In general it is not advised practice and better to use a parser for this (some library in whatever language you are using).

But for cases where you are sure the text content does not have < nor >, and these < and > are not nested, you could use this one:

[^<>]*(?=<[^<>]*>)

This only matches text that is followed by a pair of < and >.

If it is enough to test that text is followed by <, it can be simply:

[^<>]*(?=<)

trincot
  • 317,000
  • 35
  • 244
  • 286