0

I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.

import urllib2
import BeautifulSoup

url = 'http://www.bbc.co.uk/zhongwen/simp/'

page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})

print data[0]['content'].encode("utf-8")

the result I am taking is

BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text

The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?

PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.

Thank you in advance!

Darkmoor
  • 862
  • 11
  • 29

1 Answers1

1

You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.

Example:

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.bbc.co.uk/zhongwen/simp/'

page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})

with open("test.txt", "w") as myfile:
    myfile.write(data[0]['content'].encode("utf-8"))

test.txt:

BBC中文网,主页,bbcchinese.com, email news, newsletter, subscription, full text  

Which OS and terminal you are using?

4d4c
  • 8,049
  • 4
  • 24
  • 29
  • I am using Win 7 64x and I am calling .py files via cmd. – Darkmoor Sep 16 '13 at 08:15
  • CMD gives me the same result. Here are similar [question](http://stackoverflow.com/questions/2706097/how-to-do-proper-unicode-and-ansi-output-redirection-on-cmd-exe) or this [one](http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how). – 4d4c Sep 16 '13 at 11:06