regular expression to eliminate text inside < and >

Question

Possible Duplicate:
Using C# regular expressions to remove HTML tags

I'm trying to write a code that will return only the content of an HTML file. The best way I've figured revolves either around eliminating all elements within < ..> brackets, or to make a list of all text in between >...< brackets. I'm pretty new to regular expressions, but I'm pretty sure they're the way to go.

Here's the code I've tried

        Regex reg = new Regex(@"<.*>");
        file = reg.Replace(file, "");

Which works, as long as there is only one <...> before a block of text. Any file that has two or more of those elements in sequence, like <...><...>, and it just starts deleting any text it finds. Can someone tell me what I'm doing wrong?

Just try the test string in the comment. http://stackoverflow.com/a/12510496/932418 — L.B, Sep 25 '12 at 19:15
.*? will work like charm. unless you want something else to be removed. — Pradip, Sep 25 '12 at 19:23

score 0 · Answer 1 · answered Sep 25 '12 at 19:18

0

Regex are regulary greedy (they match the longest string they can find). Try checking, depending on the language you are looking for, for the +? or *? operators, that will try the shortest match. Otherwise you must build another regex.

answered Sep 25 '12 at 19:18

Enoban

26
1

THanks, I'll read up more on greediness. – George Abraham Siegel Duffy Sep 25 '12 at 19:23

score 0 · Answer 2 · answered Sep 25 '12 at 19:18

0

Well, the unexpected behavior you're getting is because your regular expression is greedy

If you change your regex to

    Regex reg = new Regex(@"<.*?>");
    file = reg.Replace(file, "");

you'll get what you expect.

Also, Know that Regex doesn't handle nesting, which HTML has a lot of, and I'd avoid using Regex to parse HTML unless you're trying to match a very specific thing, on a specifically formed piece of html.

answered Sep 25 '12 at 19:18

Sam I am says Reinstate Monica

30,851
12
72
100

Thanks. Should I use HTML agility pack instead? I saw that referenced in a comment. – George Abraham Siegel Duffy Sep 25 '12 at 19:23
@Sam what does your code give for `
it happens
` ? – L.B Sep 25 '12 at 19:25
@GeorgeAbrahamSiegelDuffy I've never really had to parse HTML myself, but I would definitely have a look at it. if you need to parse HTML. – Sam I am says Reinstate Monica Sep 25 '12 at 19:26
@L.B It renders "shit happens" you've made that comment/linked to that comment 3 times at least in this single thread. Stop being a broken record, and also read the answer you're replying to – Sam I am says Reinstate Monica Sep 25 '12 at 19:27
1

@Sam When you stop trying to parse html with regex, i will stop giving the same reference. – L.B Sep 25 '12 at 19:31
@L.B for crying out loud, get over yourself – Sam I am says Reinstate Monica Sep 25 '12 at 19:32
Actually, I was trying to figure out a way to ignore things like so a phrase wouldn't be broken up by a bolded word, so the shit happens problem kinda counts towards my advantage. – George Abraham Siegel Duffy Sep 25 '12 at 19:43
@GeorgeAbrahamSiegelDuffy LB's example is a Header formatted text, who's `title` attribute is `'e>Sh – Sam I am says Reinstate Monica Sep 25 '12 at 19:50

regular expression to eliminate text inside < and >

2 Answers2

it happens