Keep non-Latin characters when scraping page in python

Question

I have a program that scrapes a page, parses it for any links, then downloads the pages linked to (sounds like a crawler, but it's not) and saves each one in a separate file. The file name used to save is part of the url of the page. So for instance, if I find a link to www.foobar.com/foo, I would download the page and save it in a file entitled foo.xml.

Later, I need to loop through all such files and re-download them, using the file name as the last part of the url. (All pages are from a single site.)

It works well, until I encounter a non-Latin character in a url. The site uses utf-8, so when I download the original page and decode it, it works fine. But when I try to use the decoded url to download the corresponding page, it doesn't work, because, I assume, the encoding is wrong. I've tried using .encode() on the filename to change it back, but it doesn't change anything.

I know this must be very simple and a result of my not understanding encoding issues properly, but I've been cracking my head on it for a long time. I've read Joel Spolsky's introduction to encoding several times, but I still can't quite work out what to do here. Can anyone help me?

Thanks a lot, bsg

Here's some code. I don't get any errors; but when I try to download the page using the pagename as part of the url, I get told that that page doesn't exist. Of course it doesn't - there's no such page as abc/x54.

To clarify: I download the html of a page which includes a link to www.foobar.com/Mehmet Kenan Dalbaşar , e.g., but it shows up as Mehmet_Kenan_Dalba%C5%9Far. When I try to download the page www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far, the page is blank. How do I keep www.foobar.com/Mehmet Kenan Dalbaşar and return it to the site when I need to?

try:
    params = urllib.urlencode({'title': 'Foo', 'action': 'submit'})
    req = urllib2.Request(url='foobar.com',data=params, headers=headers)
    f = urllib2.urlopen(req)

    encoding = f.headers.getparam('charset')

    temp = f.read() .decode(encoding)

    #lots of code to parse out the links

    for line in links:
    try:
        pagename = line
        pagename = pagename.replace('\n', '')
        print pagename

        newpagename = pagename.replace(':', '_')
        newpagename = newpagename.replace('/', '_')
        final = os.path.join(fullpath, newpagename)
        print final
        final = final.encode('utf-8')
        print final

         ##only download the page if it hasn't already been downloaded
        if not os.path.exists(final + ".xml"):
                print "doesn't exist"
                save = open(final + ".xml", 'w')
                save.write(f.read())
                save.close()

Can you post the relevant code and what errors you get when trying to download the files? — Blender, Dec 19 '12 at 23:54
How does your constructed URL differ from that actual one? Can you post the `repr()` of both? — Blender, Dec 20 '12 at 00:38
The actual url contains the actual non-Latin character, rendered. The constructed url contains only the code for it (here, %C5%9). I want the actual character. The repr() shows the code. — bsg, Dec 20 '12 at 03:48
To go from '%C5%9F' (6 characters) to '\xC5\x9F' (2 characters), you need `urllib.unquote()`. — Armin Rigo, Dec 20 '12 at 06:38
Sorry - where would I put the unquote() and how would that help? — bsg, Dec 20 '12 at 15:12
Also, I saw another SO question suggesting the use of requests. Would that help? — bsg, Dec 20 '12 at 15:13
@bsg: I'm just answering your previous comment. If you have a url with the code `'%C5'` and want the actual character `\xC5`, then call `urllib.unquote()`. — Armin Rigo, Dec 22 '12 at 09:35
@ArminRigo, your comment solved my problem. I'd love to accept your answer as the correct one, but you didn't post an actual answer. If you post this comment about unquote() as an answer, I'll happily change the accepted answer to yours. Thanks a lot! — bsg, Dec 23 '12 at 21:50

score 1 · Answer 1 · answered Dec 21 '12 at 03:41

As you said, you can use requests instead of urllib.

Let's say you get the url "www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far", and then just pass it to requests as an argument as follows:

import requests
r=requests.get("www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far")

Now you can get the content using r.text.

score 0 · Accepted Answer · answered Jan 12 '13 at 10:05

0

If you have a url with e.g. the code '%C5' and want to obtain it with the actual character \xC5, then call urllib.unquote() on the url.

answered Jan 12 '13 at 10:05

Armin Rigo

12,048
37
48

Thanks a lot again - it really helped solve my problem. I apologize for anyone reading this question that it was so unclear - I'm not sure anyone will know why this was the answer, but it was what I was ultimately looking for and it worked. – bsg Jan 13 '13 at 01:08

Keep non-Latin characters when scraping page in python

2 Answers2