0

I need a regex which matches ">" character in a HTML string, but doesn't match tag's closed bracket. Example:

<span id="bla"> bla bla a > b bla bla bla <a>bla </a> </span>

The regex should match the ">" between a anb b

Tanparmaiel
  • 430
  • 4
  • 11
zavolokas
  • 697
  • 1
  • 5
  • 20

4 Answers4

1

You can use a negative lookbehind: (?<!\<[^>]+)\>.
Untested

This will match any > character that isn't preceded by the beginning of an HTML (a sequence starting with < and not containing >)

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • You were just a little faster than my answer; I shouldn't be so wordy. – KeithS Feb 22 '11 at 15:31
  • @zav: .Net supports negative lookbehind; Javascript doesn't. In .Net, this does work. Paste the following into LINQPad: `Regex.Matches(@" bla bla a > b bla bla bla >bla ", @"(?<!\<[^>]+)\>")` – SLaks Feb 22 '11 at 15:39
  • I was constructing this regex an hour and tested them by means of this tool. I didn't know about this issue in javascript. Thanks a lot!! It does work. – zavolokas Feb 22 '11 at 15:53
0

The following regex should work:

([^/]>)+
ennuikiller
  • 46,381
  • 14
  • 112
  • 137
0

What you need is a regex that finds "unpaired" greater-than signs; >s that are not preceded by a < as you'd find in a tag.

Try this: "(?<!\<[^<>]+)\>" It should match a greater-than that is not part of an HTML tag; that is, a construct consisting of a less-than, some number of characters other than the angle-bracket characters, then a greater than.

EDIT: put in SLak's suggestions. I'll keep the < in the "not match" block just in case the less-than being matched is also not part of a tag, for instance << or <-. It shouldn't hurt the pattern's ability to match proper tags.

KeithS
  • 70,210
  • 21
  • 112
  • 164
  • You don't need to exclude `<` in the tags, and you don't need to escape the contents of an `[]` block (except for `-`) – SLaks Feb 22 '11 at 15:32
0

A specific solution rather than just an admonition:

"Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away. " - http://www.crummy.com/software/BeautifulSoup/

Don't use regex to parse html -

"Among programmers of any experience, it is generally regarded as A Bad Idea to attempt to parse HTML with regular expressions." - Link

and "You can't parse [X]HTML with regex" - 4352 votes at the time of this posting

"Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. Be lazy, use ..." something designed for that purpose.

Community
  • 1
  • 1
Maslow
  • 18,464
  • 20
  • 106
  • 193
  • @Slaks - how do you go about reading a string containing html without it being called parsing? – Maslow Feb 22 '11 at 15:34
  • I'm not parsing HTML. I'm correcting it to be parsed with HtmlAgilityPack – zavolokas Feb 22 '11 at 15:34
  • Parsing means creating a structure (eg, a DOM tree). He's just searching for characters. – SLaks Feb 22 '11 at 15:35
  • @zav: Are you sure HAP can't handle this already? If it can't, it should. – SLaks Feb 22 '11 at 15:35
  • That's not what this article implies to me - http://en.wikipedia.org/wiki/Parsing – Maslow Feb 22 '11 at 15:36
  • @SLaks - How do you get to the plain text inside a tokenized string without parsing? just because it's not a proper or strict exception throwing operation doesn't mean it's not parsing. – Maslow Feb 22 '11 at 15:38
  • @SLaks,Zavolokas concerning this not being parsing... the first sentence of the html agility pack's purpose is "This is an agile HTML parser that builds a read/write DOM" - http://htmlagilitypack.codeplex.com/ – Maslow Feb 22 '11 at 15:39
  • At least the version I have throws an exception – zavolokas Feb 22 '11 at 15:40