I have a python string that has HTML code in it, coming from JSON that I want to parse using lxml library. The string has several escape characters and other special characters. How to clean this code so that I can extract information from it using lxml? I want to use the XPATH selectros on the string.
String-
<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html>\r\n\r\n<head>\r\n <META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\r\n</head>\r\n\r\n<body>\r\n\r\n<div>\r\n <table width=\"640\" align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:14px\">\r\n <tr>\r\n <td align=\"center\">\r\n\r\n <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"max-width:600px;text-align:left\">\r\n <tr>\r\n <td width=\"600\">\r\n <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\">\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n <tr>\r\n <td align=\"center\">\r\n <a href=\"#0.1_\"><img src=\"https://ns.yatracdn.com/common/images/emailers/corp-flight-hotel/yatra-logo.png\" width=\"101\" height=\"45\" alt=\"Yatra.com\" title=\"Yatra.com\" border=\"0\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:25px;color:#ea2330\" vspace=\"0\" hspace=\"0\" align=\"center\"></a>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n <tr>\r\n <td>\r\n <table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\" style=\"border:1px solid #d8d8d8\">\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n <tr>\r\n <td width=\"10\"></td>\r\n <td colspan=\"3\"><b>Travel Request Details</b></td>\r\n </tr>\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n <tr>\r\n <td width=\"10\"></td>\r\n <td>\r\n <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"border:1px solid #d8d8d8\">\r\n <tbody>\r\n <tr>\r\n <td width=\"10\"></td>\r\n <td>\r\n <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\">\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Email Verification Date / Time </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n </tr id='aaaaa'>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Request Submission Date / Time </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Product </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Flight</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Journey Type </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">One way</td>\r\n </tr>\r\n\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Adult </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">1</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Child </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Infant </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Flight Class </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Travel Class</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Preferred Airline </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">\r\n </td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Non Stop Flight </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Airline</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Traveller Email </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">ankityadav56@demo.com</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Traveller Mobile</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">9971255462</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Travel Policy Email</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Corporate.traveler@yatra.com</td>\r\n </tr>\r\n\r\n <tr >\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Origin</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">New Delhi(DEL)</td>\r\n </tr>\r\n\r\n <tr >\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Destination</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Mumbai(BOM)</td>\r\n </tr>\r\n\r\n <tr >\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Depart Date</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">26 Jun 2020</td>\r\n </tr>\r\n\r\n <tr >\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Time From</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">00:23</td>\r\n </tr>\r\n\r\n </table>\r\n </td>\r\n <td width=\"10\"></td>\r\n </tr>\r\n\r\n </tbody>\r\n </table>\r\n\r\n </td>\r\n <td width=\"10\"></td>\r\n </tr>\r\n\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n </table>\r\n\r\n </td>\r\n </tr>\r\n </table>\r\n </td>\r\n </tr>\r\n </table>\r\n </td>\r\n </tr>\r\n </table>\r\n\r\n</div>\r\n\r\n</body>\r\n\r\n</html>
With clean string the parser works like this-
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> parser = etree.HTMLParser()
>>> tree = etree.parse(StringIO(broken_html), parser)
>>> result = etree.tostring(tree.getroot(),
... pretty_print=True, method="html")
>>> print(result)
<html>
<head>
<title>test</title>
</head>
<body>
<h1>page title</h1>
</body>
</html>