I'm doing a bit of data scraping on Wikipedia, and I want to read certain entries.
I'm using the urllib.urlopen('http://www.example.com')
and urllib.read()
This works fine until it encounters non English characters like Stanislav Šesták Here's are the first few lines:
import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Stanislav_Šesták").read()
result:
<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>Stanislav ֵ estֳ¡k - Wikipedia, the free encyclopedia</title>
<meta name="generator" content="MediaWiki 1.23wmf8" />
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&action=edit" />
<link rel="edit" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&action=edit" />
<link rel="apple-touch-icon" href="//bits.wikimedia.org/apple-touch/wikipedia.png" />
How can I retain the non-English characters? In the end this code will write the entry title and the URL in a .txt file.