0

I need to match & which is present in plain text but it should not capture the & from entities like i

e.g.,

hi this is a plain text containing & and the entity E , & and &

In the above text I should find only & which is in text--i.e., coming after containing. I tried this pattern &[^#x]* but I couldn't get all matches.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Hulk
  • 215
  • 1
  • 5
  • 24

2 Answers2

4

The stolen regex to match HTML entities from another answer combined with look-aheads:

&(?!(amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|
     \#[1-9]\d{1,3}|[A-Za-z][0-9A-Za-z]+);)

Shortened:

&(?!(\#[1-9]\d{1,3}|[A-Za-z][0-9A-Za-z]+);)

Explained:

We want to match & but not &123; etc.

&                 // match an ampersand
(                 // group starts
    ?!            // negative look-ahead (don't match '&' if this group matches)
    (\#[1-9]\d{1,3}|[A-Za-z][0-9A-Za-z]+); // regex to match HTML entity after '&'
)                 // group ends
Community
  • 1
  • 1
mmdemirbas
  • 9,060
  • 5
  • 45
  • 53
0

With [^#x] you match all single characters that are not '#' nor 'x'. What you probably want is &[^#][^x]. If you may have '&' at the end of string or the sting may be shorter than 3 characters, you have to consider these cases in addition.

PS: Escaping depends on your actual flavour of regex.

EDIT

For the case of &amp (and all other HTML entities, e.g. ! = !) you can simply provied alternatives, e.g. &([^#][^x])|([^a][^m][^p])|([^e][^x][^c][^l])

If your flavour of regex allows look-ahead assertion, it is easier to use &(?!(#x|amp|excl)) etc.

Matthias
  • 8,018
  • 2
  • 27
  • 53
  • thanks for reply.. but what about the third case i.e., & i dont want to catch this also – Hulk Aug 23 '12 at 09:39