Get strange letters from Arabic alphabet when scrape an Arabic website

Question

I would to scrape this site: http://waqfeya.com/book.php?bid=1

but when I do I get characters like these ÇáÞÑÂä ÇáßÑíã .

This how looks my script:

import requests
from bs4 import BeautifulSoup
BASE_URL = "http://waqfeya.com/book.php?bid=1" 
source = requests.get(BASE_URL)
soup = BeautifulSoup(source.text, 'lxml') 
print(soup)

I tried these things but don't work for me:

source.encoding = 'utf-8'

and this:

source.encoding = 'ISO-8859-1'

also this:

soup = BeautifulSoup(source.text, from_endocing='ISO-8859-1')

But none worked for me.

check out this here https://stackoverflow.com/a/2087433/8272698 — Julian Silvestri, Feb 12 '19 at 16:14

score 1 · Answer 1 · answered Feb 12 '19 at 16:20

1

Use urlopen instead of request

from bs4 import BeautifulSoup
from urllib import urlopen

BASE_URL = "http://waqfeya.com/book.php?bid=1"
open = urlopen(BASE_URL)
soup = BeautifulSoup(open, 'lxml')
print(soup.encode('utf-8'))

answered Feb 12 '19 at 16:20

Omer Tekbiyik

4,255
1
15
27

score 1 · Answer 2 · answered Feb 12 '19 at 20:01

Sometimes Requests may get the encoding wrong. For this site we can get the correct encoding from the Source.

You can assign the encoding like source.encoding='windows-1256' before using source.text in BeautifulSoup.

import requests
BASE_URL = "http://waqfeya.com/book.php?bid=1"
source = requests.get(BASE_URL)
print(source.encoding)
print(source.apparent_encoding)
source.encoding='windows-1256'
print(source.text)

I was able to get all the Arabic characters properly.

Get strange letters from Arabic alphabet when scrape an Arabic website

2 Answers2