0

How would I write a RegEx to:

Find a match where the first instance of a > character is before the first instance of a < character.

(I am looking for bad HTML where the closing > initially in a line has no opening <.)

Gumbo
  • 643,351
  • 109
  • 780
  • 844
Ian Vink
  • 66,960
  • 104
  • 341
  • 555
  • This does assume that your HTML is formatted in a way that allows all HTML to be on one line. So if someone starts an HTML tag and closes it on the next line you'd get a false positive although the HTML would be valid. – spig Aug 17 '10 at 14:53

3 Answers3

2

It's a pretty bad idea to try to parse html with regex, or even try to detect broken html with a regex.

What happens when there is a linebreak so that the > character is the first character on the line for example (valid html).

You might get some mileage from reading the answers to this question also: RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Alexander Kjäll
  • 4,246
  • 3
  • 33
  • 57
1

Would this work?

string =~ /^[^<]*>/

This should start at the beginning of the line, look for all characters that aren't an open '<' and then match if it finds a close '>' tag.

spig
  • 1,667
  • 6
  • 22
  • 29
  • 1
    what happens if the > was a closing one from the line above? – Alexander Kjäll Aug 17 '10 at 14:54
  • I think that's a problem with the question. This will do what he asked it to do. To get the previous lines opens up the can-of-worms with using a regular expression to check a non-regular language. – spig Aug 17 '10 at 15:06
  • In perl/ruby and other languages you can use the "m" modifier which will treat the entire string as one line regardless of line breaks. I re-read his question and he doesn't necessarily specify that it would be all on one line. `string =~ /^[^<]*>/m` – spig Aug 17 '10 at 15:32
0
^[^<>]*>

if you need the corresponding < as well,

^[^<>]*>[^<]*<

If there is a possibility of tags before the first >,

^[^<>]*(?:<[^<>]+>[^<>]*)*>

Note that it can give false positives, e.g.

<!-- > -->

is a valid HTML, but the RegEx will complain.

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005