everyone.
For my research projects, I have collected some web pages.
For example, http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3
As you see the above web page, the committer's name is not English.
Other web pages, also, have committers' names written in various languages not English.
The following codes are for handling with committers' names.
import csv
import re
import urllib
def get_page (link):
k = 1
while k == 1:
try:
f = urllib.urlopen (link)
htmlSource = f.read()
return htmlSource
except EnvironmentError:
print ('Error occured:', link)
else:
k = 2
f.close()
def get_commit_info (commit_page):
commit_page_string = str (commit_page)
author_pattern = re.compile (r'<tr><th>author</th><td>(.*?)</td><td class=', re.DOTALL)
t_author = author_pattern.findall (commit_page_string)
t_author_string = str (t_author)
author_point = re.search (" <", t_author_string)
author = t_author_string[:author_point.start()]
print author
git_url = "http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3"
commit_page = get_page (git_url)
get_commit_info (commit_page)
The result of 'print author' is as follows:
\xd0\x9c\xd0\xb8\xd1\x80\xd0\xbe\xd1\x81\xd0\xbb\xd0\xb0\xd0\xb2 \xd0\x9d\xd0\ xb8\xd0\xba\xd0\xbe\xd0\xbb\xd0\xb8\xd1\x9b
How can I print the name exactly?