Which regular expression to use to extract some words from an HTML text?

Question

I am having a hard time building a regular expression to grab some words from a HTML text.

Let's say I have the following :

SOME_TEXT_I_WANTSOME_OTHER_TEXT

*SOME_TEXT_I_WANT* and *SOME_OTHER_TEXT* can be either a bunch of words like "SOME RANDOM TEXT" or HTML text like "SOME BOLD TEXT"

My goal is to extract those texts with one regex.

What if the second part is `
some text
and some other text
and yet more text`? Regexps and HTML are always a brittle combination. — Piskvor left the building, Dec 07 '10 at 13:43
Before the haters start up, there is a movement concerning html and RE. RE CAN parse simple html to a degree and can do it well. However, like Piskvor says (and well might I add), "it is brittle"; doable but be careful of your source. — Keng, Dec 07 '10 at 13:57

jensgram · Accepted Answer · 2010-12-07T13:43:32.010

4

Which language do you intend to use? Does a HTML parser exist for this language? If yes, consider using a parser.

However, if this is a "one-off", you may be able to get through with something along the lines of:

#<p[^>]*>(.*?)</p>#

The above has certain limitations, most notably it does not match  b">... nor nested s. (I am not able to tell whether the mark-up you're trying to parse actually allows nested s—just informing you on possible pitfalls.)

edited Dec 07 '10 at 13:43

answered Dec 07 '10 at 13:37

jensgram

31,109
6
81
98

Indeed. This can work for **very simple** HTML-like strings; using regexps for extracting data from HTML is a nightmare waiting to happen. – Piskvor left the building Dec 07 '10 at 13:41
@Piskvor Exactly. One should always be very certain that the input is actually suited for expression-based matching before jumping on the RegExp band-wagon. – jensgram Dec 07 '10 at 13:45
This is working great ! It was quite obvious but I was looking for a difficult thing... And don't worry, this is not intended to parse large web pages but only some text from a personnal web app. – Anth0 Dec 07 '10 at 15:00
@Anth0 In that case you should be fine :) – jensgram Dec 07 '10 at 16:21

score 0 · Answer 2 · answered Dec 07 '10 at 13:38

0

Assuming you are using PHP:

$html = "<p>some text here</p>"
preg_replace("/<.+?>/","", $html);

answered Dec 07 '10 at 13:38

Vlad.P

1,464
1
17
29

score 0 · Answer 3 · edited May 23 '17 at 11:47

0

Don't use regex. If you ask why, there is a very popular SO post that describes what can happen if you try to use regex for parsing HTML.

Use your language's HTML or XML parser and extract what you need using existing functionality.

edited May 23 '17 at 11:47

Community

1
1

answered Dec 07 '10 at 13:52

darioo

46,442
10
75
103

Which regular expression to use to extract some words from an HTML text?

3 Answers3