I'm currently trying to extract the href (Emails) from HTML files provided by a client of my company. They sent me 6 months worth of data but I'm unable to extract the emails from 2 particular files. I keep getting the same UnicodeDecodeError everytime no matter what I try. According to my analysis, these files are encoded in "utf-8" format. I'll leave the code down below:
from bs4 import BeautifulSoup as bsoup
url = r"C:\Users\Maximiliano\Documents\enero.html"
soup = bsoup((open(url).read()))
data = []
for p in soup.find_all("a"):
datos = p.get("href")
if datos[0] != "m":
pass
else:
data.append(datos)
print(data)
I've already tried adding a ".decode("utf-8") after the read but it is not doing anything. Please help me!