-2

I'm talking about stuff like & which will then render to: & when it actually should render to &. In this I asked how to match entities, but it seems that isn't really possible or realistic with regexes. What then is the best way to match double entities?

EDIT: Is this a good way to do it? .replace(/&(?=#?x?[0-9a-z]+);/i, '&');

(I'm using javascript)

Community
  • 1
  • 1
wwaawaw
  • 6,867
  • 9
  • 32
  • 42

3 Answers3

2

I'd go with

 pattern       &([a-zA-Z0-9]+?;)\1
 replacement   &$1

to replace just double amps, or:

 pattern       &([#a-zA-Z0-9]+?;)

EDIT:

your pattern

 /&(?=#?x?[0-9a-z]+);/i

looks also good to me.

Note: none of these is something you can trust

guido
  • 18,864
  • 6
  • 70
  • 95
0

Possibly:

&[a-zA-Z]+;

Though not fool proof.

Oded
  • 489,969
  • 99
  • 883
  • 1,009
0

Normalize your data first. Use whatever you know about encoding to decode them back to form where character/piece of data have only one possible encoding. After that match this normalized data with normalized pattern.

Oleg V. Volkov
  • 21,719
  • 4
  • 44
  • 68