1

I would to scrape this site: http://waqfeya.com/book.php?bid=1

but when I do I get characters like these ÇáÞÑÂä ÇáßÑíã .

This how looks my script:

import requests
from bs4 import BeautifulSoup
BASE_URL = "http://waqfeya.com/book.php?bid=1" 
source = requests.get(BASE_URL)
soup = BeautifulSoup(source.text, 'lxml') 
print(soup)

I tried these things but don't work for me:

source.encoding = 'utf-8'

and this:

source.encoding = 'ISO-8859-1'

also this:

soup = BeautifulSoup(source.text, from_endocing='ISO-8859-1')

​But none worked for me.

halfer
  • 19,824
  • 17
  • 99
  • 186
Oussama He
  • 555
  • 1
  • 10
  • 31

2 Answers2

1

Use urlopen instead of request

from bs4 import BeautifulSoup
from urllib import urlopen

BASE_URL = "http://waqfeya.com/book.php?bid=1"
open = urlopen(BASE_URL)
soup = BeautifulSoup(open, 'lxml')
print(soup.encode('utf-8'))
Omer Tekbiyik
  • 4,255
  • 1
  • 15
  • 27
1

Sometimes Requests may get the encoding wrong. For this site we can get the correct encoding from the Source.

enter image description here

You can assign the encoding like source.encoding='windows-1256' before using source.text in BeautifulSoup.

import requests
BASE_URL = "http://waqfeya.com/book.php?bid=1"
source = requests.get(BASE_URL)
print(source.encoding)
print(source.apparent_encoding)
source.encoding='windows-1256'
print(source.text)

I was able to get all the Arabic characters properly.

Bitto
  • 7,937
  • 1
  • 16
  • 38