1

I have small problem. I'm trying to get text whitch is out of html elements. Example input:

I want this text I want this text I want this text <I don't want this text/>
I want this text I wan this text <I don't>want this</text>

Does anybody know how is it possible by regex? I thought that I can make it by deleting element text. So, does anybody know another solution for this problem? Please help me.

user35443
  • 6,309
  • 12
  • 52
  • 75

3 Answers3

3

Instead of regex, which is not suitable for parsing HTML in general (especially malformed HTML), use an HTML parser like the HTML Agility Pack.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
1

Try this

(?<!<.*?)([^<>]+)

Explanation

@"
(?<!        # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
   <           # Match the character “<” literally
   .           # Match any single character that is not a line break character
      *?          # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
(           # Match the regular expression below and capture its match into backreference number 1
   [^<>]       # Match a single character NOT present in the list “<>”
      +           # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
Cylian
  • 10,970
  • 4
  • 42
  • 55
1

I agree that anything not trivial should be done with a HTML parser (Agility pack is excellent if you use .NET) but for small requirements as this its more than likely overkill. Then again, A HTML parser knows more about the quirks and edge cases that HTML is full of. Be sure to test well before using a regex.

Here you go

<.*?>.*?<.*?>|<.*?/>

It also correctly ignores

<I don't>want this</text>

and not just the tags

In C# this becomes

string resultString = null;
resultString = Regex.Replace(subjectString, "<.*?>.*?<.*?>|<.*?/>", "");
buckley
  • 13,690
  • 3
  • 53
  • 61