0

Possible Duplicate:
Using C# regular expressions to remove HTML tags

I'm trying to write a code that will return only the content of an HTML file. The best way I've figured revolves either around eliminating all elements within < ..> brackets, or to make a list of all text in between >...< brackets. I'm pretty new to regular expressions, but I'm pretty sure they're the way to go.

Here's the code I've tried

        Regex reg = new Regex(@"<.*>");
        file = reg.Replace(file, ""); 

Which works, as long as there is only one <...> before a block of text. Any file that has two or more of those elements in sequence, like <...><...>, and it just starts deleting any text it finds. Can someone tell me what I'm doing wrong?

Community
  • 1
  • 1

2 Answers2

0

Regex are regulary greedy (they match the longest string they can find). Try checking, depending on the language you are looking for, for the +? or *? operators, that will try the shortest match. Otherwise you must build another regex.

Enoban
  • 26
  • 1
0

Well, the unexpected behavior you're getting is because your regular expression is greedy

If you change your regex to

    Regex reg = new Regex(@"<.*?>");
    file = reg.Replace(file, ""); 

you'll get what you expect.

Also, Know that Regex doesn't handle nesting, which HTML has a lot of, and I'd avoid using Regex to parse HTML unless you're trying to match a very specific thing, on a specifically formed piece of html.