Persian text can not be parsed correctly when crawling a persian website

Question

I'm crawling some Persian/Farsi websites using request library in python. When I use the "get" method, most of the websites respond nicely but there few others who send back unknown characters. This is an example of a response using get method in request library:

In Persian(what I supposed to receive): سیاستگذاری دولت در حوزه مسکن تغییر می کند؟
response: Ø³Û\x8cØ§Ø³ØªÚ¯Ø°Ø§Ø±Û\x8c Ø¯Ù\x88Ù\x84Øª Ø¯Ø± Ø\xadÙ\x88Ø²Ù\x87 Ù\x85Ø³Ú©Ù\x86 ØªØºÛ\x8cÛ\x8cØ± Ù\x85Û\x8c Ú©Ù\x86Ø¯Ø\x9f
crawled url: http://irban.ir/ShowNews/6833/%D8%B3%DB%8C%D8%A7%D8%B3%D8%AA%DA%AF%D8%B0%D8%A7%D8%B1%DB%8C-%D8%AF%D9%88%D9%84%D8%AA-%D8%AF%D8%B1-%D8%AD%D9%88%D8%B2%D9%87-%D9%85%D8%B3%DA%A9%D9%86-%D8%AA%D8%BA%DB%8C%DB%8C%D8%B1-%D9%85%DB%8C-%DA%A9%D9%86%D8%AF

And this is my code:

import scrapy
import requests
from requests.auth import HTTPBasicAuth

url = "http://irban.ir/ShowNews/6833/%D8%B3%DB%8C%D8%A7%D8%B3%D8%AA%DA%AF%D8%B0%D8%A7%D8%B1%DB%8C-%D8%AF%D9%88%D9%84%D8%AA-%D8%AF%D8%B1-%D8%AD%D9%88%D8%B2%D9%87-%D9%85%D8%B3%DA%A9%D9%86-%D8%AA%D8%BA%DB%8C%DB%8C%D8%B1-%D9%85%DB%8C-%DA%A9%D9%86%D8%AF"
response = requests.get(url, auth=HTTPBasicAuth('test', 'testpass'),
                            headers={
                                'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'},
                            verify=False, timeout=60).text
selector = scrapy.Selector(text=response)
css_pattern = ".forTitle"
selected_value = selector.css(css_pattern).extract_first()
print(selected_value)

Persian text can not be parsed correctly when crawling a persian website

0 Answers0