0

I have to replace & with its name entity or decimal entity from input string, but input string may contains other name and decimal entities in with & will present.

Code:

import re
text =' At&T, " < I am > , At&T  so  &#60; &lt; &  & '

#- Get all name entities and decimal entities.
replace_tmp = re.findall("&#\d+;|&[a-z]+;", text)

#- Replace above values from tempvalues.
tmp_dict = {}
count = 1
for i in replace_tmp:
    text = text.replace(i, "$%d$"%count)
    tmp_dict["$%d$"%count] = i
    count += 1


#- Replace & with &amp;
text = text.replace("&", "&amp;")

#- Replace tempvalues values with original.
for i in tmp_dict:
    text = text.replace(i, tmp_dict[i])

print text

Final Output: At&amp;T, " < I am > , At&amp;T so &#60; &lt; &amp; &amp;

But Can I get regular expression which directly does above thing?


Final line in py file:

value = re.sub(r'&(?!(#[0-9]+;|[a-zA-Z]+;))', '&amp;', value).replace("<", "&lt;").replace(">", "&gt;").replace('"', "&quot;")

Vivek Sable
  • 9,938
  • 3
  • 40
  • 56

2 Answers2

1

Use string substitution with negative look ahead.

import re
text =' At&T, " < I am > , At&T  so  &#60; &lt; &  & '

text = re.sub(r'&(?![\w\d#]+?;)',"&amp;",text)
print text
mkHun
  • 5,891
  • 8
  • 38
  • 85
1
>>> import re
>>> re.sub(r'&(?!(#[0-9]+;|\w+;))', '&amp;', ' At&T, " < I am > , At&T  so  &#60; &lt; &  & ')
' At&amp;T, " < I am > , At&amp;T  so  &#60; &lt; &amp;  &amp; '

You can use negative look ahead assertion for \w+; (for eg: &nbsp;) and #[0-9]+; (for #60;).

Therefore the regex is:

&(?!(#[0-9]+;|\w+;)) negative look ahead assertion ensures there is neither #[0-9]+; nor \w+; ahead of &

You could also use [a-zA-Z]+; instead of \w+;

riteshtch
  • 8,629
  • 4
  • 25
  • 38