0

how to match this kind of line

<p><span class="font7" style="font-weight:bold;">text text text text </span></p>\r\n<p>

and at the same time avoid this kind of line

<p><span class="font7" style="font-weight:bold;">text text text text </span><span class="font7"> text text text <br/> text text text </span></p>\r\n<p>

the problem is that the tag span appears twice in the same line, i want to avoid that. only wanting if appears once in a line.

</span> 

i have tried this regex

<p><span class="font7" style="font-weight:bold;">.+?(?:(?!.+?</span>.+?$)){2}</p>\r\n<p>

please help me, if possible in .net, perl or ruby flavor

greetings

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
alex
  • 95
  • 6
  • 2
    Do not use regex to parse HTML. Please see the first answer to http://stackoverflow.com/questions/1732348 – Jim Garrison Dec 15 '12 at 01:21
  • The problem with that answer is that it is funny to those of us who understand the problems of HTML parsing, but meaningless to the novices who don't. – Andy Lester Dec 15 '12 at 03:39

1 Answers1

1

Do not try to parse HTML with regular expressions. You can't do it reliably. Regular expressions are not up to the task.

You need a proper HTML parser. It will be an HTML parser that has been well-tested and used by many people, as opposed to whatever regexes you try to cobble together.

Here are some options for Perl HTML parsers. Start there.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • thanks, but already solved the problem by my self, it was not really difficult, even though i think for some harder tasks i would be considering the option you gave. – alex Dec 15 '12 at 08:10