UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 261060: character maps to

Question

I'm currently trying to extract the href (Emails) from HTML files provided by a client of my company. They sent me 6 months worth of data but I'm unable to extract the emails from 2 particular files. I keep getting the same UnicodeDecodeError everytime no matter what I try. According to my analysis, these files are encoded in "utf-8" format. I'll leave the code down below:

from bs4 import BeautifulSoup as bsoup

url = r"C:\Users\Maximiliano\Documents\enero.html"
soup = bsoup((open(url).read()))

data = [] 
for p in soup.find_all("a"):
    datos = p.get("href")
    if datos[0] != "m":
        pass
    else:
        data.append(datos)
print(data)

I've already tried adding a ".decode("utf-8") after the read but it is not doing anything. Please help me!

file: https://gofile.io/?c=SFM1T3

Could you please provide an input file where you get that error? — Riccardo Bucco, Oct 04 '19 at 14:16
When you call `open()` without `encoding=...`, some OS- and locale-dependent default is used, apparently some Windows 8-bit encoding in your case. Look at the header for an encoding declaration (it's probably UTF-8) and specify this in the `open()` call. — lenz, Oct 04 '19 at 14:21
Did you try `open(url, encoding="UTF-8")` as @lenz suggested? See https://stackoverflow.com/q/9233027/407651. — mzjn, Oct 04 '19 at 14:38

score 2 · Accepted Answer · answered Oct 04 '19 at 14:41

2

As suggested in the comments, you simply have to add the encoding parameter:

soup = bsoup((open(url, encoding="utf-8").read()))

answered Oct 04 '19 at 14:41

Riccardo Bucco

13,980
4
22
50

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 261060: character maps to

1 Answers1