0

The following code works fine when run from Jupyter IPython notebook:

from bs4 import BeautifulSoup
xml_file_path = "<Path to XML file>"
s = BeautifulSoup(open(xml_file_path), "xml")

But it fails when creating the soup when run from Eclipse/PyDev (which uses the same Python interpreter):

Traceback (most recent call last):
  File "~/parser/scratch.py", line 3, in <module>
    s = BeautifulSoup(open(xml_file), "xml")
  File "/anaconda/lib/python3.5/site-packages/bs4/__init__.py", line 175, in __init__
    markup = markup.read()
  File "/anaconda/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1812: ordinal not in range(128)
  • Python version: 3.5.2 (Anaconda 4.1.1)
  • BeautifulSoup: version 4
  • IPython Notebook version: 4.2.1
  • Eclipse version: Mars.2 Release (4.5.2)
  • PyDev version: 5.1.2.20160623256
  • Mac OS X: El Capitan 10.11.6

UPDATE: The character in the file that is causing issue in Eclipse is , but this causes no issues in IPython Notebook! If I remove this character from the XML file, then the code works fine in Eclipse as well. Is there some setting in Eclipse I need to change so that the code won't fail on this (and possibly other such) character?

arun
  • 10,685
  • 6
  • 59
  • 81
  • Possible duplicate of [UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1](http://stackoverflow.com/questions/10561923/unicodedecodeerror-ascii-codec-cant-decode-byte-0xef-in-position-1) – DYZ Apr 11 '17 at 20:14
  • @DYZ - There is no printing here. It happens when I create the soup. – arun Apr 11 '17 at 20:17
  • Have you tried `open(xml_file_path, "utf-8")` ? – dot.Py Apr 11 '17 at 20:27
  • @dot.Py: That fails, but I tried `s = BeautifulSoup(open(xml_file_path), "xml", from_encoding="utf-8")` which also fails in Eclipse only – arun Apr 11 '17 at 20:38

1 Answers1

0

I think that you have to open with open(xml_file_path, 'rb') -- and specify the encoding for things to work the same in both (otherwise you're having an implicit conversion from bytes to unicode -- and apparently it uses a different encoding based on your env, since you have something in Eclipse and another thing in IPython).

Try doing:

with open(xml_file_path, 'rb') as stream:
  contents = stream.read()
  contents.decode('utf-8')

Just to check if you're really able to decode it as utf-8 (i.e.: to check if that char is a valid utf-8 char).

Fabio Zadrozny
  • 24,814
  • 4
  • 66
  • 78