2

I am using lxml.html for parsing html content. But I don't understand why lxml is dropping "body" tag attributes. Tried using both lxml.html.parse and lxml.html.document_fromstring as suggested here

But still it is not working.

Example html string:-

<html class="hello"> <head> <iframe src="index.html"></iframe> </head> <body class="foo"><h1>a</h1></body> </html>

Does anyone else also faced this issue?

Community
  • 1
  • 1
Karan
  • 46
  • 3

1 Answers1

0

Possibly too late to help, but I've run into a similar issue with the same underlying parser (lxml uses libxml2, which I am using directly). I believe the problem is that <iframe>s cannot appear in the <head> of the document. When libxml2 sees one there, it attempts to continue parsing by implicitly closing the <head> and starting a <body>. This implicitly created <body> is then confusing you as it does not have the class in your actual <body> tag. In fact I think your actual <body> will not appear in the parsed model at all.

Jason Sankey
  • 2,328
  • 1
  • 15
  • 12