2

I need help for encoding/decoding non-ascii url to appropriate form for feeding urlopen() method. My code for scraping url(non-ascii url) from a page and going to next page:

from urllib.request import urlopen
from bs4 import BeautifulSoup

Enterance url copy-pasted from chrome browser:

url = 'https://www.sheypoor.com/%DA%A9%D9%85%D8%AF %D9%86%D9%88%D8%AC%D9%88%D8%A7%D9%86-34926671.html'
for i in range(1,10):
    html = urlopen(url)
    page = BeautifulSoup(html.read(), 'html.parser')
    url_obj = page.findAll('a')[13]['href'].strip()
    print(url_obj)
    url = url_obj

But I got an error:

'ascii' codec can't encode characters in position 5-9: ordinal not in range(128)

When I checked "UnicodeEncodeError", I saw this notification:

----> 8     html = urlopen(url)

As you are aware of the process: In first loop, urlopen() method can work with "enterance url", because it is in form of:

https://www.sheypoor.com/%DA%A9%D9%85%D8%AF-%D9%86%D9%88%D8%AC%D9%88%D8%A7%D9%86-34926671.html

But the problem will start when url_obj, which is scraped from BeautifulSoup object, is in form of

https://www.sheypoor.com/سرویس-تخت-کمد-نوجوان-44887762.html

replaced with older url, and this form is not appropriate for feeding to urlopen() method:

I tried to find solution for converting my url_object to correct url form such as enterance url,but I failed! :-(

I would be so pleased for your support and guide to solving this problem.

Shafizadeh
  • 9,960
  • 12
  • 52
  • 89
  • Can you post the code you used to attempt to convert the `url_obj` to a valid URL? URLs are only allowed to have characters from a limited character set, and it appears that you're pulling the `href` value as a string, which will be unicode. You'll need to convert that unicode to a valid URL using something like `urllib.quote()`. – gaige Apr 27 '18 at 11:20
  • for first approach: url_obj.encode("UTF-8"), which replace farsi characters in url with "\xd8\xb3\xd8\xb1\xd9\x88\xdb\x8c\xd8\xb3-\xd8\xaa\xd8\xae\xd8\xaa-\xda\xa9\xd9\x85\xd8\xaf-\xd9\x86\xd9\x88\xd8\xac\xd9\x88\xd8\xa7\xd9\x86" – Homayoun Soleimani Apr 27 '18 at 20:37
  • for the second approach i used from urllib.parse import unquote unquote(url_obj) – Homayoun Soleimani Apr 27 '18 at 20:54
  • You still need to use `urllib.quote` after you've UTF-8 encoded the string. `.encode('UTF-8')` will take the internal representation and make it UTF-8, but that's not sufficient for a URL, it also requires the URL Quoting (to give you the %). So, likely you need `new_url = urllib.quote(string.encode('UTF-8'))` – gaige Apr 28 '18 at 00:32
  • You can use the below link: [https://stackoverflow.com/questions/11818362/how-to-deal-with-unicode-string-in-url-in-python3][1] – Mesbah Ahmadi Sep 28 '19 at 06:24

1 Answers1

0

you could use something like this

from urllib.request import urlopen
from urllib.parse import quote
persian_url = 'https://www.isna.ir/news/99010100077/' + quote('حواشی-در-آکروباتیک-ژیمناستیک-بالا-گرفت-دبیر-هم-استعفا-کرد')
page = urlopen(persian_url)

the url was : 'https://www.isna.ir/news/99010100077/حواشی-در-آکروباتیک-ژیمناستیک-بالا-گرفت-دبیر-هم-استعفا-کرد'

narsan
  • 62
  • 10