Regex to match ">" in HTML

Question

I need a regex which matches ">" character in a HTML string, but doesn't match tag's closed bracket. Example:

The regex should match the ">" between a anb b

What @SLaks said. Plus: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — Cfreak, Feb 22 '11 at 15:26
no, this is a kind of a bad formed HTML and instead > there is ">" — zavolokas, Feb 22 '11 at 15:27
@Matt, @CFreak: Regex will work fine for this. All you need to know is whether you're inside a start/end tag. — SLaks, Feb 22 '11 at 15:28
@Matt: No; why would I mean that? _All_ content is between tags. — SLaks, Feb 22 '11 at 15:30

SLaks · Accepted Answer · 2011-02-22T15:38:41.010

1

You can use a negative lookbehind: (?<!\<[^>]+)\>.
Untested

This will match any > character that isn't preceded by the beginning of an HTML (a sequence starting with < and not containing >)

edited Feb 22 '11 at 15:38

answered Feb 22 '11 at 15:29

SLaks

868,454
176
1,908
1,964

You were just a little faster than my answer; I shouldn't be so wordy. – KeithS Feb 22 '11 at 15:31
@zav: .Net supports negative lookbehind; Javascript doesn't. In .Net, this does work. Paste the following into LINQPad: `Regex.Matches(@" bla bla a > b bla bla bla >bla ", @"(?<!\<[^>]+)\>")` – SLaks Feb 22 '11 at 15:39
I was constructing this regex an hour and tested them by means of this tool. I didn't know about this issue in javascript. Thanks a lot!! It does work. – zavolokas Feb 22 '11 at 15:53

score 0 · Answer 2 · answered Feb 22 '11 at 15:27

0

The following regex should work:

([^/]>)+

answered Feb 22 '11 at 15:27

ennuikiller

46,381
14
112
137

KeithS · Answer 3 · 2011-02-22T15:46:28.800

0

What you need is a regex that finds "unpaired" greater-than signs; >s that are not preceded by a < as you'd find in a tag.

Try this: "(?<!\<[^<>]+)\>" It should match a greater-than that is not part of an HTML tag; that is, a construct consisting of a less-than, some number of characters other than the angle-bracket characters, then a greater than.

EDIT: put in SLak's suggestions. I'll keep the < in the "not match" block just in case the less-than being matched is also not part of a tag, for instance << or <-. It shouldn't hurt the pattern's ability to match proper tags.

edited Feb 22 '11 at 15:46

answered Feb 22 '11 at 15:29

KeithS

70,210
21
112
164

You don't need to exclude `<` in the tags, and you don't need to escape the contents of an `[]` block (except for `-`) – SLaks Feb 22 '11 at 15:32

score 0 · Answer 4 · edited May 23 '17 at 12:04

0

A specific solution rather than just an admonition:

"Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away. " - http://www.crummy.com/software/BeautifulSoup/

Don't use regex to parse html -

"Among programmers of any experience, it is generally regarded as A Bad Idea to attempt to parse HTML with regular expressions." - Link

and "You can't parse [X]HTML with regex" - 4352 votes at the time of this posting

"Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. Be lazy, use ..." something designed for that purpose.

edited May 23 '17 at 12:04

Community

1
1

answered Feb 22 '11 at 15:30

Maslow

18,464
20
106
193

@Slaks - how do you go about reading a string containing html without it being called parsing? – Maslow Feb 22 '11 at 15:34
I'm not parsing HTML. I'm correcting it to be parsed with HtmlAgilityPack – zavolokas Feb 22 '11 at 15:34
Parsing means creating a structure (eg, a DOM tree). He's just searching for characters. – SLaks Feb 22 '11 at 15:35
@zav: Are you sure HAP can't handle this already? If it can't, it should. – SLaks Feb 22 '11 at 15:35
That's not what this article implies to me - http://en.wikipedia.org/wiki/Parsing – Maslow Feb 22 '11 at 15:36
@SLaks - How do you get to the plain text inside a tokenized string without parsing? just because it's not a proper or strict exception throwing operation doesn't mean it's not parsing. – Maslow Feb 22 '11 at 15:38
@SLaks,Zavolokas concerning this not being parsing... the first sentence of the html agility pack's purpose is "This is an agile HTML parser that builds a read/write DOM" - http://htmlagilitypack.codeplex.com/ – Maslow Feb 22 '11 at 15:39
At least the version I have throws an exception – zavolokas Feb 22 '11 at 15:40

Regex to match ">" in HTML

4 Answers4