-2

I'm trying to read and edit an html file. I'm using BeautifulSoup to edit the html in place but I'm finding that even before the "soup" is made my html file has already been interpreted by the read() function. For example:

<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered)">
</head>

<a href="Aug_24_2018.txt"><b>Aug 24 2018: Report</a></br>
<a href="Aug_23_2018.txt"><b>Aug 23 2018: Report</a></br>
<a href="Aug_22_2018.txt"><b>Aug 22 2018: Report</a></br>
<a href="Aug_21_2018.txt"><b>Aug 21 2018: Report</a></br>
<a href="Aug_20_2018.txt"><b>Aug 20 2018: Report</a></br>

</html>

becomes this:

<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered)">
</head>

<a href="Aug_24_2018.txt"><b>Aug 24 2018: Report</a>
<a href="Aug_23_2018.txt"><b>Aug 23 2018: Report</a>
<a href="Aug_22_2018.txt"><b>Aug 22 2018: Report</a>
<a href="Aug_21_2018.txt"><b>Aug 21 2018: Report</a>
<a href="Aug_20_2018.txt"><b>Aug 20 2018: Report</a>

</html>

which is very different as it ruins the formatting and smushes all the domains together.

This is the code I'm using to read:

with open("/data/report.html") as inf:
    txt = inf.read() #this is where the problem occurs
    soup = bs4.BeautifulSoup(txt, 'lxml') 

I'm not at liberty to change the formatting of the original file, so I want to conform to it as much as possible. Any possible solutions to keep the </br> tag?

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
LMP
  • 79
  • 1
  • 1
  • 9
  • Use 'html.parser' or 'html5lib' instead of 'lxml' – SanthoshSolomon Aug 24 '18 at 15:51
  • 3
    It's also
    , self closing, not a closing tag.
    – Adam Aug 24 '18 at 15:51
  • Check this out for the tag: https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br though no doubt it's not your issues (or wouldn't expect it to be) – Adam Aug 24 '18 at 15:52
  • @SmashGuy the problem isn't the beautiful soup parser, it's the read() function – LMP Aug 24 '18 at 15:52
  • 2
    I regenerated the test case and the `` disappears only after parsing with `lxml` parser. – SanthoshSolomon Aug 24 '18 at 15:59
  • you markup is a mess. there's no such thing as ``. Also you do not close ``. – Marcin Orlowski Aug 24 '18 at 15:59
  • Yes the b-tag before it is syntactically wrong. – Martin Aug 24 '18 at 16:11
  • 2
    The `` probably _is_ your problem. It makes your HTML invalid, and impossible to parse without various hacky guesses by the parser. The `lxml` parser is apparently just throwing them away, which is reasonable. A different parser might raise an exception, or turn it into a guess at what you really meant, or try to handle it exactly the same way Netscape 4.0.1 would have, or whatever. You can’t complain that any of those options is “wrong”. Your HTML isn’t valid HTML, so if you want it parser as HTML, it’s either going to change, or fail to parse. – abarnert Aug 24 '18 at 16:16
  • @abarnert I agree that the html is wrong, but why would the python read() function get rid of it? Is it the encoding? – LMP Aug 24 '18 at 16:22
  • 1
    @LMP I'd be willing to bet that the `read` _doesn't_ get rid of it, and whatever you're doing to debug this, you just checked it in the wrong place. [See your code running on repl.it](https://repl.it/repls/GivingDigitalAnalyst): the `` are still there after `read`, but they're not there after parsing. – abarnert Aug 24 '18 at 17:38

1 Answers1

0

It looks like someone have failed at closing the b-tag and added a “/br”-tag by mistake. As this is invalid HTML I’d caution against keeping it. Instead consider replacing it with a which hopefully was someone’s intent in the first place. For this I’d use a text editor, like notepad or vim.

File open doesn’t change the HTML while reading it.

Martin
  • 619
  • 5
  • 13
  • 1
    BeautifulSoup was _originally_ a more lenient parser than lxml, but it hasn’t been that in many years. As of version 4, its not even a parser at all, it’s a wrapper around your choice of parsers (with lxml as the default) that gives them all a more convenient, and consistent, interface. Recommending that someone look at lxml when they’re already using lxml and it’s exactly what’s causing their problem is silly. And “if is isn’t valid HTML, it’s probably not worth it” isn’t much of an answer for “how do I parse this invalid HTML?” – abarnert Aug 24 '18 at 16:29
  • Your regex doesn’t match anything in the OP’s code, and in fact does the exact opposite of what they asked for—instead of preserving invalid `` tags, you’re removing valid `
    ` tags.
    – abarnert Aug 24 '18 at 16:31
  • @abarnert, I guess this is probably why I switched to xml.etree some time ago. However I notice that lxml.etree exist. Is this lxml with a etree-style API? – Martin Aug 24 '18 at 22:46
  • 1
    Yes, `lxml.etree` is a superset of the ElementTree API. And it's been the main API for using `lxml` since the start, and the only one even mentioned in the tutorials. – abarnert Aug 24 '18 at 22:54