Extract text from HTML markup

Question

I would like to extract static text from between HTML tags:

<p>
text here
<span> text here <b>too</b></span>
</p>

I have this regular expression so far:

(&lt;|<)[\s\/\?]*(\w+)(?<attributes>.*?)[\s\/\?]*(&gt;|>)(\n|.)*?<\/\2>

I don't want to use HTML parser. Any help. Thanks!!

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Kerrek SB, Feb 04 '12 at 00:00
I saw that post, but I am not looking for parsing the whole HTML document. I just need to extract static texts wherever possible. The file types I am using contain other symnbols which invalidates XML rules, so it not possible to convert to XML easily. — , Feb 04 '12 at 02:18

score 0 · Answer 1 · edited May 23 '17 at 11:50

0

using RegEx to parse HTML is Bad Idea (tm).

look here,here, and here for more/better words of wisdom on the subject.

edited May 23 '17 at 11:50

Community

1
1

answered Feb 04 '12 at 00:00

Muad'Dib

28,542
5
55
68

I am using JavaScript, maybe I can use iterations on the match results to find inner tags?! – Feb 04 '12 at 02:20

score 0 · Accepted Answer · answered Feb 04 '12 at 02:12

Parsing HTML with regexes is usually a bad idea, but that's not exactly what you're trying to do here. All you really want is to strip out the HTML tags. In your example, you try to match the tags and parse out the attributes. But you don't need to do this.

If the following assumptions hold:

You don't need to get rid of HTML entities
Your tags don't define any whitespace (i.e. you don't care that <p> delimits paragraphs)
You don't have any comments or doctypes

Then all you need to do is to strip the pattern </?[^>]+>.

Escaped, in vim, this is:

s/<\/\?[^>]\+>//g

Extract text from HTML markup

2 Answers2

Linked