0

I'm currently trying to scrap a Chinese character as well as non standard letters. In the results it's like Mechanize just skipped the Chinese character or non standard letter.

My code:

import mechanize
import re

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0')]
br.set_handle_robots(False)

html = br.open('http://hanzidb.org/character-list/by-frequency')

html = html.read().lower()
html = unicode(html, errors='ignore')

#Only get the data between <td>...</dr>
pattern2 = re.compile(r'<td>(.*?)</td>', re.MULTILINE)
match_description2 = re.findall(pattern2, html)

data = []

#Collect the content of the table
for desc in match_description2:
    data.append(desc)
    print desc

The result I should be getting (example):

<tr><td><a href="/character/是">是</a></td><td><span style="color:#000099;">shì</span></td><td><span class="smmr">indeed, yes, right; to be; demonstrative pronoun, this, that</span></td><td><a href="/character/日" title="Kangxi radical 72">日</a>&nbsp;72.5</td><td>9</td><td>1</td><td>1479</td></td><td>3</td></tr>

Versus the result I am getting:

<td><a href="/character/"></a></td><td><span style="color:#000099;">sh</span></td><td><span class="smmr">indeed, yes, right; to be; demonstrative pronoun, this, that</span></td><td><a href="/character/" title="kangxi radical 72"></a>&nbsp;72.5</td><td>9</td><td>1</td><td>1479</td></td><td>3</td>

I appreciate any help and I can provide any more info if need be.

han058
  • 908
  • 8
  • 19
CJ Jacobs
  • 299
  • 1
  • 16

1 Answers1

1

You must remove the line html = unicode(html, errors='ignore')

your terminal environment about LANG must be UTF-8

and RUN your code!

han058
  • 908
  • 8
  • 19
  • Sorry I didn't clarify that was all of my code, only the relevant bits. Also, that change worked, thanks a ton! – CJ Jacobs Mar 25 '16 at 04:48