0

I am going through a website whose web page have urls in Nepali i.e. Non-English font. How do I give the start_urls for any spider(I am using scrapy for the purpose)? Is there any kind of encoding technique for that? And does the direct copy-paste of urls from browser a chance?

Updated: And I need to further parse into links that I get at certain webpage. And of course those links are non- English as well. Thank you...

Nabin
  • 11,216
  • 8
  • 63
  • 98
  • Which version of Python? Have you actually tried to copy and paste the URLs? – jonrsharpe May 20 '14 at 10:20
  • Version 2.7. Yes I have tried copy paste and it don't seem to work. But I am not sure. i have updated my question as well. Thank you – Nabin May 20 '14 at 10:22
  • 1
    URLs are encoded in UTF8, see [Url decode UTF-8 in Python](http://stackoverflow.com/q/16566069) for example. Your Nepali URLs will be no different. – Martijn Pieters May 20 '14 at 10:25
  • For me, it's almost always @MartijnPieters. :-) Thank you. I will look into it. – Nabin May 20 '14 at 10:26
  • Now when assigning the links obtained to item['link'], I get the following: 'Request' object does not support item assignment Does it any do with the Non-English url? @jonrsharpe – Nabin May 20 '14 at 11:09
  • @Nabin nope, nothing to do with that. – jonrsharpe May 20 '14 at 11:12

1 Answers1

1

URLs that conform to RFC 3986 will be encoded using UTF-8 and URL Percent Encoding. Nepali uses the Devanagari script, which is perfectly representable in Unicode and thus can be encoded in UTF-8.

Take a look at the Nepali Wikipedia for examples. That specific URL is a good example of the UTF-8 and URL percent encoding:

http://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0

The series of %E0%A4%AE escapes are percent-encoded UTF-8 bytes. The HTML source code of the page should have these URLs already encoded, but if they look like this instead:

http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ

you can encode the path portion yourself with:

import urlparse, urllib

parts = urlparse.urlsplit(u'http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ')
parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')

Demo:

>>> import urlparse, urllib
>>> parts = urlparse.urlsplit(u'http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ')
>>> parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
>>> parts.geturl().encode('ascii')
'http://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0'
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Now when assigning the links obtained to item['link'], I get the following: 'Request' object does not support item assignment Does it any do with the Non-English url? – Nabin May 20 '14 at 11:06
  • @Nabin: sorry, I don't know how scrapy internals work. That's an entirely separate issue, however. If there are no duplicates on Stack Overflow for that issue, feel free to ask a new question about that issue. – Martijn Pieters May 20 '14 at 11:15
  • @Nabin - as per Martijn's suggestion, I would recommend creating a new question and adding the scrapy tag. Please include your full spider code and the URL of the site that you are trying to scrape to aid debugging of the issue. :) – Talvalin May 20 '14 at 13:55