0

I am trying to scrape data from a Bengali (language) website. When I inspect element on that website, everything is as it should.

code:

request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.content, "lxml")
print(soup.prettify())

Part of the output:

<strong>
  সà¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾
</strong>

সà¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾ >> should be >>"সচরাচর জিজ্ঞাসা"

I am not sure if it is ASCII or not. I used https://onlineasciitools.com/convert-ascii-to-unicode to convert that text into Unicode. As per this website, it may be ASCII. But I checked an ASCII table online and none of those characters were in it. So now I need to convert those text into readable stuff. Any help?

The Golden
  • 37
  • 5

3 Answers3

0

You should just decode the content, like this:

request.content.decode('utf-8')
sstevan
  • 477
  • 2
  • 9
  • 25
0

Yes, its work. You need to decode('utf-8') request response.

import requests
from bs4 import BeautifulSoup
request = requests.get("https://corona.gov.bd/")

soup = BeautifulSoup(request.content.decode('utf-8'), "lxml")
my_data = soup.find('div', {'class':'col-md-6 col-sm-6 col-xs-12 slider-button-center xs-mb-15'})
print(my_data.get_text(strip=True, separator='|'))

print output:

্বাস্থ্য বিষয়ক সেবা|(ডাক্তার, হাসপাতাল, ঔষধ, টেস্ট)|খাদ্য ও জরুরি সেবা|(খাদ্য, অ্যাম্বুলেন্স, ফায়ার সার্ভিস)|সচরাচর জিজ্ঞাসা|FAQ
Samsul Islam
  • 2,581
  • 2
  • 17
  • 23
0

The request returned by requests.get() returns both the raw byte content (request.content) and and the content decoded by the encoding declared in the content.

request.encoding is the actual encoding (which may not be UTF-8), and request.text is the already-decoded content.

Example using request.text instead:

import requests
from bs4 import BeautifulSoup

request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.text, "lxml")
print(soup.find('title'))
<title>করোনা ভাইরাস ইনফো ২০১৯ | Coronavirus Disease 2019 (COVID-19) Information Bangladesh | corona.gov.bd</title>
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251