0

I run to this problem within kind of trivial task. HTML text should not contain chars '<' and '>' and '&'. The third is riddle for me. I want to use regular expression to find all '&' chars but this character could be contained in entity names, i.e. & which could be contained. So my requirements for regex is to find all '&' which aren't contained in format &[a-z]; I am not regex master so the best solution I figured out is this Regex:

Regex _allAmps = new Regex("((&[a-z]*;))|[&]", RegexOptions.Compiled | RegexOptions.IgnoreCase));
...
List<Match> invalidChars.AddRange(_allAmps.Matches(htmlText).Cast<Match>.Where()m => m.Value.Lenght == 1);

But this is improvisation. Regex matches all single chars and all entity names and kept are only single chars. Is there way how to compose such regular expression? I tried negative lookahead, but in that way regex matches all '&'chars.

Liam
  • 27,717
  • 28
  • 128
  • 190
Qerts
  • 935
  • 1
  • 15
  • 29
  • 2
    Why don't you decode the HTML you get and just match regular plain text? What parser are you using (if any)? Also, have a look at [this answer of mine: *Complete HTML Strip function*](http://stackoverflow.com/questions/30028021/complete-html-strip-function/30028142#30028142). – Wiktor Stribiżew Nov 18 '15 at 15:11
  • Just [HtmlEncode](https://msdn.microsoft.com/en-us/library/system.web.httpserverutility.htmlencode(v=vs.110).aspx) it. – Liam Nov 18 '15 at 15:11
  • Well, this question is not about HTML itself, but primarily about regex. I thought everything could be accomplished through regular expressions so I am curious how to do something like this. – Qerts Nov 18 '15 at 15:18
  • If you are interested in anything specific, please post an [MVCE (minimal complete verifiable example)](http://stackoverflow.com/help/mcve). Also, your code contains typo: `m.Value.Lenght` and `Cast` must be `Cast()` I guess, and there are mor issues. You know, we can also post an "improvisation" answer if you like :) – Wiktor Stribiżew Nov 18 '15 at 15:30
  • Don't you think the answer is just [`&(?!\w*;)`](http://regexstorm.net/tester?p=%26(%3f!%5cw*%3b)&i=%26gt%3b%3d35+%26amp%3b+and+%26+%26lt%3b50)? – Wiktor Stribiżew Nov 18 '15 at 15:39
  • Yes, it is! Thank you. I just used that lookahead all time wrong way. Set is as answer please, I will mark it. – Qerts Nov 18 '15 at 15:51
  • Go with [sln's answer](http://stackoverflow.com/a/33785432/20938). It includes coverage for numerical entities like `&` (another commonly used encoding the ampersand). – Alan Moore Nov 18 '15 at 19:12
  • @Liam HtmlEncoding it would double-encode any proper Html, which would result in bad text. – ErikE Nov 19 '15 at 00:46
  • HTML decode the string and HTML encode the result. Save that. User's HTML mistakes be erased! Too bad, so sad. – ErikE Nov 19 '15 at 00:49

2 Answers2

0

You could use a lookahead assertion.

(?i)[&](?!(?:[a-z]+\d+|(?:\#(?:[0-9]+|x[0-9a-f]+)));)

Formatted

 (?i)                          # Case insensitive
 [&]                           # Ampersand (can make it [%&] to be thourough )
 (?!                           # Only if not an entity
      (?:
           [a-z]+\d+ 
        |  (?:
                \#
                (?:
                     [0-9]+ 
                  |  x [0-9a-f]+ 
                )
           )
      )
      ;     
 )
  • Good, but there are some named entities that contain digits as letters. They're always at the end, so `[a-z]+\d+` will cover them. – Alan Moore Nov 18 '15 at 19:05
  • @AlanMoore - Ok, its updated... Yeah, dtd's. I guess you'd have to ad hoc read the predefined entity list, base a regex on that. From Wikipedia: 'The HTML 4 specification requires the use of the standard DTDs and does not allow users to define additional entities.' While xml specification allows user defined <!ENTITY ...> using parameter `%name;` character `..;` and named `&name;` references, where _name_ in ascii is roughly equal to `[A-Za-z_:][\w:.-]*`. In Unicode _name_ covers many more ranges of characters. You can take notice that semi-colon is a valid first character. –  Nov 19 '15 at 00:24
  • As I went through the entity list in a html 4 dtd, I did not notice any _name_ character's other than alpha num, so that's why I didn't use a `\w`. In reality, a valid sgml name can include (in ASCII) `[A-Za-z_:][\w:.-]*. ` So, if entities are a real concern across the family, I guess you could use that. –  Nov 19 '15 at 00:30
  • 1
    Correct me if I'm reading this wrong... is the non-capturing group after the first `|` actually needed? the alternation inside it occurs within its own non-capturing group. – ErikE Nov 19 '15 at 00:43
  • @ErikE - You're right, it's not needed. It's there for emphasis. –  Nov 19 '15 at 19:05
  • I'm not fond of code artifacts that communicate "this does something important" but don't actually do anything. I think it wastes developer time trying to understand the code. In my mind, non-capturing groups are only appropriate when the parentheses are actually required, in order to perform alternation or repetition. Regexes are complicated enough to understand that adding stuff to them is not useful (even for emphasis). Of course, it's your answer and you get to put what you want in there! – ErikE Nov 19 '15 at 19:07
  • @ErikE - Well you know, sometimes encapsulation is a good thing. Something could be added later. I'm pretty sure the engine will work it out. –  Nov 19 '15 at 19:15
  • Just a philosophical difference here I guess. In my book, **The code should communicate what it does, not what it could do**. – ErikE Nov 19 '15 at 19:17
  • @ErikE - For no reason at all, sometimes I'll just scope a block of C++ code. –  Nov 19 '15 at 19:18
0

Why don't you use Regex boundaries. Have a look at this http://www.rexegg.com/regex-boundaries.html

Akash
  • 99
  • 1
  • 8