1

I save the xml page locally from an API of Merriam-Webster, let me give you the url: http://www.dictionaryapi.com/api/v1/references/collegiate/xml/apple?key=bf534d02-bf4e-49bc-b43f-37f68a0bf4fd

That was an example. I urlretrieve it from the url and save it as a xml file.

Now I want to open it but a UnicodeDecodeError occurs.

I did :

page = open('test.xml')
bs = BeautifulSoup(page)

Then the following error happens:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb

I tried to make the url u'test.xml' it didn't work.

sys.getdefaultencoding() 'utf-8'

The encoding configuration is already utf-8, which doesn't solve the problem, thanks for the advice anyway.

1 Answers1

1

You need to specify the encoding as utf-8 which is what the data is encoded as, the filename has nothing to do with what is inside so prefixing with u to make a unicode string is not going to help:

import io
with io.open('test.xml', encoding="utf-8") as page:
      bs = BeautifulSoup(page)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thanks! I tried to use pickle module, when I use that, it says that I run out of inpuy – Yinxuan Feng Aug 10 '16 at 13:05
  • @AlexZhang, how were you using pickle? – Padraic Cunningham Aug 10 '16 at 13:23
  • Wait a minute, I am trying something...I will explain soon – Yinxuan Feng Aug 10 '16 at 14:19
  • So this is what I did with pickle: htmlFile = urlopen(theUrl).read().decode('utf-8') with open(filePath, 'wb') as file: pickle.dump(htmlFile, file, pickle.HIGHEST_PROTOCOL) And I think that maybe it is because I right the argument as wb so that it is stored as btyes, anyway next is: with open(filePath,'rb') as file: print(filePath,':',pickle.load(file)) return pickle.load(file) So when I try to load it it say that kind of problem. And no matter what I try about opening it decode it into utf-8 it won't work, because the same error happens when read() – Yinxuan Feng Aug 10 '16 at 14:37
  • Why are you dumping the html using pickle? – Padraic Cunningham Aug 10 '16 at 15:21
  • Actually I tried everything I know, I tried just open and write, I tried urlretrieve, and pickle was my only hope, they all malfunction... Seems like that's something about python3? As aforementioned, I use python 3.4.4. I tried your method, the same error occurs, but thanks. If you know what is going on now please help – Yinxuan Feng Aug 11 '16 at 14:06
  • Add the full error traceback – Padraic Cunningham Aug 11 '16 at 15:02
  • Thank you so much I just solved it! I just need to simply open the file using f = open(filename, 'rb'), and BeautifulSoup(f), then it can work. – Yinxuan Feng Aug 12 '16 at 06:54