I'm trying to read and edit an html file. I'm using BeautifulSoup to edit the html in place but I'm finding that even before the "soup" is made my html file has already been interpreted by the read() function. For example:
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered)">
</head>
<a href="Aug_24_2018.txt"><b>Aug 24 2018: Report</a></br>
<a href="Aug_23_2018.txt"><b>Aug 23 2018: Report</a></br>
<a href="Aug_22_2018.txt"><b>Aug 22 2018: Report</a></br>
<a href="Aug_21_2018.txt"><b>Aug 21 2018: Report</a></br>
<a href="Aug_20_2018.txt"><b>Aug 20 2018: Report</a></br>
</html>
becomes this:
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered)">
</head>
<a href="Aug_24_2018.txt"><b>Aug 24 2018: Report</a>
<a href="Aug_23_2018.txt"><b>Aug 23 2018: Report</a>
<a href="Aug_22_2018.txt"><b>Aug 22 2018: Report</a>
<a href="Aug_21_2018.txt"><b>Aug 21 2018: Report</a>
<a href="Aug_20_2018.txt"><b>Aug 20 2018: Report</a>
</html>
which is very different as it ruins the formatting and smushes all the domains together.
This is the code I'm using to read:
with open("/data/report.html") as inf:
txt = inf.read() #this is where the problem occurs
soup = bs4.BeautifulSoup(txt, 'lxml')
I'm not at liberty to change the formatting of the original file, so I want to conform to it as much as possible. Any possible solutions to keep the </br>
tag?
, self closing, not a closing tag. – Adam Aug 24 '18 at 15:51