2

I am building a scraper where I want to extract the data from some tags as it is without any conversion. But Beautifulsoup changing some hex values to ASCII. For example, this code gets converted into ASCII

html = """\
<title>&#x42;&#x69;&#x6C;&#x6C;&#x69;&#x6E;&#x67;&#x20;&#x61;&#x64;&#x64;&#x72;&#x65;&#x73;&#x73; - &#x50;&#x61;&#x79;&#x50;&#x61;&#x6C;</title>
<title>Billing address - PayPal</title>"""

Here's the small example of the code

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for element in soup.findAll(['title', 'form', 'a']):
    print(str(element))

But I want to extract the data in the same form. I believe BeautifulSoup 4 auto converting HTML entities and this is what I don't want. Any help would be really appreciated.

BTW I am using Python 3.5 and Beautifulsoup 4

Arjun Thakur
  • 635
  • 8
  • 21
  • Possible duplicate of [Decode HTML entities in Python string?](https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) – vishal Jul 05 '18 at 12:56
  • BeautifulSoup 4 auto converting HTML entities and this is what I don't want. – Arjun Thakur Jul 06 '18 at 05:21

1 Answers1

1

you might try using re module ( Regular Expressions ). for an instance the code below will extract the title tag info without converting it: (I assumed that you declared html variable before)

import re
result = re.search('\<title\>.*\<\/title\>',html).group(0)
print(result) # It'll print <title>&#x42;&#x69;&#x6C;&#x6C;&#x69;&#x6E;&#x67;&#x20;&#x61;&#x64;&#x64;&#x72;&#x65;&#x73;&#x73; - &#x50;&#x61;&#x79;&#x50;&#x61;&#x6C;</title>

You may do the same for the other tags as well

Alireza
  • 168
  • 1
  • 7
  • Thanks for the quick answer, But I want to achieve this with BeautifulSoup only. Its because there are multiple other tags from which I am fetching the values and might possible they have the same hex strings. – Arjun Thakur Jul 06 '18 at 05:24