0

I'm doing a bit of data scraping on Wikipedia, and I want to read certain entries. I'm using the urllib.urlopen('http://www.example.com') and urllib.read()

This works fine until it encounters non English characters like Stanislav Šesták Here's are the first few lines:

import urllib

print urllib.urlopen("http://en.wikipedia.org/wiki/Stanislav_Šesták").read()

result:

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>Stanislav ֵ estֳ¡k - Wikipedia, the free encyclopedia</title>
<meta name="generator" content="MediaWiki 1.23wmf8" />
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="edit" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="apple-touch-icon" href="//bits.wikimedia.org/apple-touch/wikipedia.png" />

How can I retain the non-English characters? In the end this code will write the entry title and the URL in a .txt file.

DSM
  • 342,061
  • 65
  • 592
  • 494
HDunn
  • 533
  • 2
  • 13
  • 26
  • Lines of what? Post a fully-functional script that illustrates your problem. – Blender Jan 06 '14 at 16:47
  • Lines of the Wiki entry as read by Python. – HDunn Jan 06 '14 at 16:53
  • The character encoding is UTF-8, as indicated by the meta tag (and corroborated by experience). You are viewing or saving it in some other encoding. You are not showing us this step, and I don't think we can guess what it is. – tripleee Jan 06 '14 at 19:32
  • What are the actual bytes in the title? The first accented character should be 0xC5 0xA0 and the second ought to be 0xC3 0xA1, as you can see from the URL. – tripleee Jan 06 '14 at 19:34
  • Sorry, I don't quite understand what you're asking me. When I view the source code on the Wiki entry, it says UTF-8, all my script does is print that, what step is there supposed to be in the way? – HDunn Jan 06 '14 at 19:41
  • 1
    Then you are probably viewing it in something which isn't properly configured for UTF-8. Again, what are the actual bytes where you see the wrong glyphs? – tripleee Jan 06 '14 at 19:47
  • This is what I see:http://oi41.tinypic.com/fxs288.jpg The first is U+05Bx5, the second is U+05Bx3, both Hebrew diacritic marks. – HDunn Jan 06 '14 at 20:04

1 Answers1

1

There are multiple issues:

  • non-ascii characters in a string literal: you must specify encoding declaration at the top of the module in this case
  • you should urlencode the url path (u"Stanislav_Šesták" -> "Stanislav_%C5%A0est%C3%A1k")
  • you are printing bytes received from a web to your terminal. Unless both use the same character encoding then you might see garbage instead of some characters
  • to interpret html, you should probably use an html parser

Here's a code example that takes into account the above remarks:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cgi
import urllib
import urllib2

wiki_title = u"Stanislav_Šesták"
url_path = urllib.quote(wiki_title.encode('utf-8'))
r = urllib2.urlopen("https://en.wikipedia.org/wiki/" + url_path)
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset')
content = r.read()
unicode_text = content.decode(encoding or 'utf-8')
print unicode_text # if it fails; set PYTHONIOENCODING

Related:

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670