8

The solutions in other answers do not work when I try them, the same string outputs when I try those methods.

I am trying to do web scraping using Python 2.7. I have the webpage downloaded and it has some characters which are in the form &#120 where 120 seems to represent the ascii code. I tried using HTMLParser() and decode() methods but nothing seems to work. Please note that what I have from the webpage in the format are only those characters. Example:

&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32

Please guide me to decode these strings using Python. I have read the other answers but the solutions don't seem to work for me.

Ivankovich
  • 119
  • 2
  • 8
  • They aren't valid [character references](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_reference_overview): they are missing the terminating semicolon. But once those semicolons are added, that example decodes to `Blasterjaxx ` – PM 2Ring Jul 20 '16 at 11:45
  • 3
    Try to specify encoding explicitly while you downloading those pages – frist Jul 20 '16 at 12:03

3 Answers3

6

The correct format for character reference is &#nnnn; so the ; is missing in your example. You can add the ; and then use HTMLParser.unescape() :

from HTMLParser import HTMLParser
import re
x ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)

This gives this output :

Blasterjaxx 
Blasterjaxx 
Fabich
  • 2,768
  • 3
  • 30
  • 44
  • This solution is deprecated, please refer to solution presented below (add semicolons first): https://stackoverflow.com/a/55985595/318618 – LucasBr Nov 19 '21 at 16:02
4

In Python 3, use the html module:

>>> import html
>>> html.unescape('&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32')
'Blasterjaxx '

docs: https://docs.python.org/3/library/html.html

frnhr
  • 12,354
  • 9
  • 63
  • 90
3

Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.

However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,

s ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
print ''.join([chr(int(u)) for u in s.split('&#') if u])

output

Blasterjaxx 

The if u skips over the initial empty string that we get because s begins with the splitting string '&#'. Alternatively, we could skip it by slicing:

''.join([chr(int(u)) for u in s.split('&#')[1:]])
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182