0

I'm currently trying to extract the href (Emails) from HTML files provided by a client of my company. They sent me 6 months worth of data but I'm unable to extract the emails from 2 particular files. I keep getting the same UnicodeDecodeError everytime no matter what I try. According to my analysis, these files are encoded in "utf-8" format. I'll leave the code down below:

from bs4 import BeautifulSoup as bsoup

url = r"C:\Users\Maximiliano\Documents\enero.html"
soup = bsoup((open(url).read()))

data = [] 
for p in soup.find_all("a"):
    datos = p.get("href")
    if datos[0] != "m":
        pass
    else:
        data.append(datos)
print(data)

I've already tried adding a ".decode("utf-8") after the read but it is not doing anything. Please help me!

file: https://gofile.io/?c=SFM1T3

  • Could you please provide an input file where you get that error? – Riccardo Bucco Oct 04 '19 at 14:16
  • 1
    When you call `open()` without `encoding=...`, some OS- and locale-dependent default is used, apparently some Windows 8-bit encoding in your case. Look at the header for an encoding declaration (it's probably UTF-8) and specify this in the `open()` call. – lenz Oct 04 '19 at 14:21
  • I just added the HTML file! – MaximilianoCifuentes Oct 04 '19 at 14:30
  • Did you try `open(url, encoding="UTF-8")` as @lenz suggested? See https://stackoverflow.com/q/9233027/407651. – mzjn Oct 04 '19 at 14:38

1 Answers1

2

As suggested in the comments, you simply have to add the encoding parameter:

soup = bsoup((open(url, encoding="utf-8").read()))
Riccardo Bucco
  • 13,980
  • 4
  • 22
  • 50