0

In my case I want to remove specifically the and the characters from a string. I use BeautifulSoup to parse certain html paragraphs, and get a substring from them. So far my code looks like this:

# -*- coding: cp1252 -*-
from bs4 import BeautifulSoup as bs
import re

soup = bs(open("file.xhtml"), 'html.parser')

for tag in soup.find_all('p', {"class": "fnp2"}) :
    line = unicode(str(tag).split(':')[0], "utf-8")
    line = re.sub('(<p class="fnp2">)(\d+) ', '', line)
    line = line.replace('„', '')
    print line

But for that, I always receive a UnicodeDecodeError:

line = line.replace('„', '')

UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position
0: ordinal not in range(128)

What would be a solution for this?

martineau
  • 119,623
  • 25
  • 170
  • 301
narancs
  • 15
  • 5
  • Does your .xhtml specify an encoding? – lit Nov 28 '18 at 17:49
  • 2
    Try changing the problematic line to line.replace(u'„', ''). It may also be the case that you are getting the error in the "print" statement, however. Finally, make sure your script file is actually saved in cp1252 (or better, use UTF8 for all of your code, always, and mark it in the header). – KT. Nov 28 '18 at 17:57
  • 2
    Have you tried `open("file.xhtml", encoding='utf-8')`? – TigerhawkT3 Nov 28 '18 at 18:06
  • @lit Yes it does: ` ` – narancs Nov 28 '18 at 18:25
  • @KT. Thank you. It seems like `line.replace(u'„', '')` was the answer. How can I make sure the script file is saved in cp1252, other than writing `# -*- coding: cp1252 -*-` at the start of the file? @lit and @TigerhawkT3 thank you for your suggestions. – narancs Nov 28 '18 at 18:36
  • The encoding that you save your file in is determined by your text editor. Most editors and IDEs provide a way to specify the encoding (and most save in UTF8 by default, by the way). The line `# -*- coding: ...` is there to tell the Python interpreter the encoding the file was saved in, it does *not* force your editor to save the file in this encoding (unless it is a very smart Python IDE). – KT. Nov 28 '18 at 18:39

1 Answers1

1

The line variable in your code is a unicode object. When you call line.replace Python expects the first argument to also be a unicode object. If you provide a str object instead, Python will try to automatically decode it into a unicode string using the system default encoding (which you can check via sys.getdefaultencoding()).

Apparently, the system encoding is ascii in your case. The byte string '„' cannot be decoded using the ascii codec, because '„' is not an ACII symbol, which causes the Exception that you see.

You could fix the problem by changing the default system encoding to the same one you used to provide the '„' string (CP1252, I guess), however such a fix is only interesting from the academic point of view, as it just sweeps the issue under the carpet.

A proper, safe and easy fix to your problem would be to simply provide a unicode object to the replace method in the first place. This would be as simple as replacing '„' with u'„' in your code.

KT.
  • 10,815
  • 4
  • 47
  • 71