reading web pages including various languages such Russian, Korean and etc

Question

everyone.

For my research projects, I have collected some web pages.

For example, http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3

As you see the above web page, the committer's name is not English.

Other web pages, also, have committers' names written in various languages not English.

The following codes are for handling with committers' names.

import csv
import re
import urllib

def get_page (link):
    k = 1
    while k == 1:
        try:
            f = urllib.urlopen (link)
            htmlSource = f.read()
            return htmlSource
        except EnvironmentError:
            print ('Error occured:', link)
        else:
            k = 2
    f.close()

def get_commit_info (commit_page):
    commit_page_string = str (commit_page)


    author_pattern = re.compile (r'<tr><th>author</th><td>(.*?)</td><td class=', re.DOTALL)
    t_author = author_pattern.findall (commit_page_string)

    t_author_string = str (t_author)
    author_point = re.search (" &lt;", t_author_string)
    author = t_author_string[:author_point.start()]

    print author

git_url = "http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3"
commit_page = get_page (git_url)
get_commit_info (commit_page)

The result of 'print author' is as follows:

\xd0\x9c\xd0\xb8\xd1\x80\xd0\xbe\xd1\x81\xd0\xbb\xd0\xb0\xd0\xb2 \xd0\x9d\xd0\ xb8\xd0\xba\xd0\xbe\xd0\xbb\xd0\xb8\xd1\x9b

How can I print the name exactly?

You're parsing HTML with regexes, which is almost always [a bad idea](http://stackoverflow.com/a/1732454/434217). Use a library like BeautifulSoup, which knows about Unicode. — Thomas K, May 18 '12 at 12:15
This isn't codereview.SE, but... the stuff going on with `k` is probably also a Bad Idea. If you actually can recover from that `EnvironmentError` by going into an infinite loop, use `while True:` and a `break` in the `else:` clause, rather than introducing a new variable. — Wooble, May 18 '12 at 13:25

aychedee · Accepted Answer · 2012-05-19T06:07:36.860

WELL... this will do what you want

author = 'Мирослав Николић'
print author.decode('utf8') # Мирослав Николић

But it also won't work if the encoding isn't UTF8...

Mostly things use utf8. Mostly.

Unicode is complicated stuff to get your head around. 'author' is a string object that contains bytes. There is no information in those bytes to tell you what those bytes represent. Absolutely none. You have to tell Python that this string of bytes are code points in UTF8. For each byte you come across, look it up in the UTF8 code table and see which UTF8 unicode glyph this represents.

You could detect the encoding for each page by looking at the meta tags. In html5 they would look like this:

<meta charset="utf-8">.

reading web pages including various languages such Russian, Korean and etc

1 Answers1