Decoding ampersand hash strings (|xa)etc

Question

The solutions in other answers do not work when I try them, the same string outputs when I try those methods.

I am trying to do web scraping using Python 2.7. I have the webpage downloaded and it has some characters which are in the form &#120 where 120 seems to represent the ascii code. I tried using HTMLParser() and decode() methods but nothing seems to work. Please note that what I have from the webpage in the format are only those characters. Example:

&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32

Please guide me to decode these strings using Python. I have read the other answers but the solutions don't seem to work for me.

They aren't valid [character references](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_reference_overview): they are missing the terminating semicolon. But once those semicolons are added, that example decodes to `Blasterjaxx ` — PM 2Ring, Jul 20 '16 at 11:45
Try to specify encoding explicitly while you downloading those pages — frist, Jul 20 '16 at 12:03

score 6 · Answer 1 · answered Jul 20 '16 at 12:30

6

The correct format for character reference is &#nnnn; so the ; is missing in your example. You can add the ; and then use HTMLParser.unescape() :

from HTMLParser import HTMLParser
import re
x ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)

This gives this output :

&#66;&#108;&#97;&#115;&#116;&#101;&#114;&#106;&#97;&#120;&#120;&#32;
Blasterjaxx

answered Jul 20 '16 at 12:30

Fabich

2,768
3
30
44

This solution is deprecated, please refer to solution presented below (add semicolons first): https://stackoverflow.com/a/55985595/318618 – LucasBr Nov 19 '21 at 16:02

score 4 · Answer 2 · answered May 04 '19 at 18:23

4

In Python 3, use the html module:

>>> import html
>>> html.unescape('&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32')
'Blasterjaxx '

docs: https://docs.python.org/3/library/html.html

answered May 04 '19 at 18:23

frnhr

12,354
9
63
90

score 3 · Accepted Answer · answered Jul 20 '16 at 13:11

Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.

However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,

s ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
print ''.join([chr(int(u)) for u in s.split('&#') if u])

output

Blasterjaxx

The if u skips over the initial empty string that we get because s begins with the splitting string '&#'. Alternatively, we could skip it by slicing:

''.join([chr(int(u)) for u in s.split('&#')[1:]])

Decoding ampersand hash strings (|xa)etc

3 Answers3

Linked