0

I would like to extract static text from between HTML tags:

<p>
text here
<span> text here <b>too</b></span>
</p>

I have this regular expression so far:

(&lt;|<)[\s\/\?]*(\w+)(?<attributes>.*?)[\s\/\?]*(&gt;|>)(\n|.)*?<\/\2>

I don't want to use HTML parser. Any help. Thanks!!

  • 1
    possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Kerrek SB Feb 04 '12 at 00:00
  • Why don't you want to use an HTML parser? – CanSpice Feb 04 '12 at 00:03
  • I saw that post, but I am not looking for parsing the whole HTML document. I just need to extract static texts wherever possible. The file types I am using contain other symnbols which invalidates XML rules, so it not possible to convert to XML easily. –  Feb 04 '12 at 02:18

2 Answers2

0

using RegEx to parse HTML is Bad Idea (tm).

look here,here, and here for more/better words of wisdom on the subject.

Community
  • 1
  • 1
Muad'Dib
  • 28,542
  • 5
  • 55
  • 68
  • I am using JavaScript, maybe I can use iterations on the match results to find inner tags?! –  Feb 04 '12 at 02:20
0

Parsing HTML with regexes is usually a bad idea, but that's not exactly what you're trying to do here. All you really want is to strip out the HTML tags. In your example, you try to match the tags and parse out the attributes. But you don't need to do this.

If the following assumptions hold:

  • You don't need to get rid of HTML entities
  • Your tags don't define any whitespace (i.e. you don't care that <p> delimits paragraphs)
  • You don't have any comments or doctypes

Then all you need to do is to strip the pattern </?[^>]+>.

Escaped, in vim, this is:

s/<\/\?[^>]\+>//g
beerbajay
  • 19,652
  • 6
  • 58
  • 75