BeautifulSoup shows strange text

Question

I am trying to scrape data from a Bengali (language) website. When I inspect element on that website, everything is as it should.

code:

request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.content, "lxml")
print(soup.prettify())

Part of the output:

<strong>
  à¦¸à¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾
</strong>

à¦¸à¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾ >> should be >>"সচরাচর জিজ্ঞাসা"

I am not sure if it is ASCII or not. I used https://onlineasciitools.com/convert-ascii-to-unicode to convert that text into Unicode. As per this website, it may be ASCII. But I checked an ASCII table online and none of those characters were in it. So now I need to convert those text into readable stuff. Any help?

Use `request.text` instead. The content will be decoded for you, assuming the website declared the encoding correctly. — Mark Tolonen, Nov 22 '20 at 04:37

score 0 · Accepted Answer · answered Nov 21 '20 at 13:03

0

You should just decode the content, like this:

request.content.decode('utf-8')

answered Nov 21 '20 at 13:03

sstevan

477
2
9
25

score 0 · Answer 2 · answered Nov 21 '20 at 17:02

Yes, its work. You need to decode('utf-8') request response.

import requests
from bs4 import BeautifulSoup
request = requests.get("https://corona.gov.bd/")

soup = BeautifulSoup(request.content.decode('utf-8'), "lxml")
my_data = soup.find('div', {'class':'col-md-6 col-sm-6 col-xs-12 slider-button-center xs-mb-15'})
print(my_data.get_text(strip=True, separator='|'))

print output:

্বাস্থ্য বিষয়ক সেবা|(ডাক্তার, হাসপাতাল, ঔষধ, টেস্ট)|খাদ্য ও জরুরি সেবা|(খাদ্য, অ্যাম্বুলেন্স, ফায়ার সার্ভিস)|সচরাচর জিজ্ঞাসা|FAQ

Mark Tolonen · Answer 3 · 2020-11-22T04:39:46.573

The request returned by requests.get() returns both the raw byte content (request.content) and and the content decoded by the encoding declared in the content.

request.encoding is the actual encoding (which may not be UTF-8), and request.text is the already-decoded content.

Example using request.text instead:

import requests
from bs4 import BeautifulSoup

request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.text, "lxml")
print(soup.find('title'))

<title>করোনা ভাইরাস ইনফো ২০১৯ | Coronavirus Disease 2019 (COVID-19) Information Bangladesh | corona.gov.bd</title>

BeautifulSoup shows strange text

3 Answers3