Using Mechanize to get Chinese characters from a website is returning nothing

Question

I'm currently trying to scrap a Chinese character as well as non standard letters. In the results it's like Mechanize just skipped the Chinese character or non standard letter.

My code:

import mechanize
import re

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0')]
br.set_handle_robots(False)

html = br.open('http://hanzidb.org/character-list/by-frequency')

html = html.read().lower()
html = unicode(html, errors='ignore')

#Only get the data between <td>...</dr>
pattern2 = re.compile(r'<td>(.*?)</td>', re.MULTILINE)
match_description2 = re.findall(pattern2, html)

data = []

#Collect the content of the table
for desc in match_description2:
    data.append(desc)
    print desc

The result I should be getting (example):

<tr><td><a href="/character/是">是</a></td><td><span style="color:#000099;">shì</span></td><td><span class="smmr">indeed, yes, right; to be; demonstrative pronoun, this, that</span></td><td><a href="/character/日" title="Kangxi radical 72">日</a>&nbsp;72.5</td><td>9</td><td>1</td><td>1479</td></td><td>3</td></tr>

Versus the result I am getting:

<td><a href="/character/"></a></td><td><span style="color:#000099;">sh</span></td><td><span class="smmr">indeed, yes, right; to be; demonstrative pronoun, this, that</span></td><td><a href="/character/" title="kangxi radical 72"></a>&nbsp;72.5</td><td>9</td><td>1</td><td>1479</td></td><td>3</td>

I appreciate any help and I can provide any more info if need be.

Please use `beautifulsoup4` instead to parse HTML. Using regular expressions for HTML can lead to [undesirable results](http://stackoverflow.com/a/1732454/918959) — Antti Haapala -- Слава Україні, Mar 25 '16 at 05:24

score 1 · Accepted Answer · answered Mar 25 '16 at 04:46

1

You must remove the line html = unicode(html, errors='ignore')

your terminal environment about LANG must be UTF-8

and RUN your code!

answered Mar 25 '16 at 04:46

han058

908
8
19

Sorry I didn't clarify that was all of my code, only the relevant bits. Also, that change worked, thanks a ton! – CJ Jacobs Mar 25 '16 at 04:48

Using Mechanize to get Chinese characters from a website is returning nothing

1 Answers1