0

I want to parse a html snippet like below with the xml.etree.ElementTree module of python 3.

<html>
  <table>
    ...
    <td><img src="myimg.png" title="mytitle" alt="myalttext"></td>
    ...
  </table>
</html>

But there is a "mismatched tag" parsing error due to the img-tag, which is closed with ">" not with "/>". No error occurs when I close the tag with "/>".

However, I'm loading that html from a website, so I just can't expect them to close the img-tag with "/>" (additionally this seems also to be a common way to just close it simply with ">", see W3Schools on images)

Do you have an idea how I can avoid this? I would also like to avoid to parse it manually before passing it to the xml.etree.ElementTree parser. I would also like to avoid using other parsers, unless they are already available in the default python library.

tangoal
  • 724
  • 1
  • 9
  • 28
  • 1
    The HTML snippet is not well-formed XML (because of the image element not being closed properly). Therefore you cannot parse it with ElementTree, which is for XML only. – mzjn Feb 24 '18 at 10:54
  • There must be a way to just ignore that img-tag. I don't even need that piece of information in the img tag. Of course I could remove it via regex before passing it to the parser or close the tag to achieve a well-formed XML. But there is no other way? – tangoal Feb 24 '18 at 11:45
  • 2
    ElementTree works with XML, not "almost XML", and it is not possible to just ignore the `img` tag. But there are many other options. I would recommend you to try BeautifulSoup (even though it's not in the standard library): https://www.crummy.com/software/BeautifulSoup/. See also https://stackoverflow.com/q/11709079/407651 and https://stackoverflow.com/q/2505041/407651. – mzjn Feb 24 '18 at 11:59
  • Not what I wanted originally, but finally, I parsed it by myself and did some replacements before pass it to the xml.etree.ElementTree parser. Not the most generic and future-orientated, but quick and it works fine. There also was other stuff to do like defining unknown entities in the CDATA section and introduction of a root element. @mzjn: Thanks anyway, I had a look on BeautifulSoup and I will give it a try in one of the upcoming projects. – tangoal Feb 25 '18 at 08:31
  • Better going with BeautifulSoup from the beginning. Would have saved me a lot of time, since also encountered further troubles. So I just switched to BeautifulSoup and it just works from the scratch. Had to learn it the hard way. – tangoal Feb 25 '18 at 21:59

0 Answers0