Python Beautifulsoup extract hexadecimal values

Question

I am building a scraper where I want to extract the data from some tags as it is without any conversion. But Beautifulsoup changing some hex values to ASCII. For example, this code gets converted into ASCII

html = """\
<title>&#x42;&#x69;&#x6C;&#x6C;&#x69;&#x6E;&#x67;&#x20;&#x61;&#x64;&#x64;&#x72;&#x65;&#x73;&#x73; - &#x50;&#x61;&#x79;&#x50;&#x61;&#x6C;</title>
<title>Billing address - PayPal</title>"""

Here's the small example of the code

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for element in soup.findAll(['title', 'form', 'a']):
    print(str(element))

But I want to extract the data in the same form. I believe BeautifulSoup 4 auto converting HTML entities and this is what I don't want. Any help would be really appreciated.

BTW I am using Python 3.5 and Beautifulsoup 4

Possible duplicate of [Decode HTML entities in Python string?](https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) — vishal, Jul 05 '18 at 12:56
BeautifulSoup 4 auto converting HTML entities and this is what I don't want. — Arjun Thakur, Jul 06 '18 at 05:21

score 1 · Answer 1 · answered Jul 05 '18 at 13:21

1

you might try using re module ( Regular Expressions ). for an instance the code below will extract the title tag info without converting it: (I assumed that you declared html variable before)

import re
result = re.search('\<title\>.*\<\/title\>',html).group(0)
print(result) # It'll print <title>&#x42;&#x69;&#x6C;&#x6C;&#x69;&#x6E;&#x67;&#x20;&#x61;&#x64;&#x64;&#x72;&#x65;&#x73;&#x73; - &#x50;&#x61;&#x79;&#x50;&#x61;&#x6C;</title>

You may do the same for the other tags as well

answered Jul 05 '18 at 13:21

Alireza

168
1
7

Thanks for the quick answer, But I want to achieve this with BeautifulSoup only. Its because there are multiple other tags from which I am fetching the values and might possible they have the same hex strings. – Arjun Thakur Jul 06 '18 at 05:24

Python Beautifulsoup extract hexadecimal values

1 Answers1