0

I have a python string that has HTML code in it, coming from JSON that I want to parse using lxml library. The string has several escape characters and other special characters. How to clean this code so that I can extract information from it using lxml? I want to use the XPATH selectros on the string.

String-

<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html>\r\n\r\n<head>\r\n    <META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\r\n</head>\r\n\r\n<body>\r\n\r\n<div>\r\n    <table width=\"640\" align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:14px\">\r\n        <tr>\r\n            <td align=\"center\">\r\n\r\n                <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"max-width:600px;text-align:left\">\r\n                    <tr>\r\n                        <td width=\"600\">\r\n                            <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\">\r\n                                <tr>\r\n                                    <td height=\"10\"></td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td align=\"center\">\r\n                                        <a href=\"#0.1_\"><img src=\"https://ns.yatracdn.com/common/images/emailers/corp-flight-hotel/yatra-logo.png\" width=\"101\" height=\"45\" alt=\"Yatra.com\" title=\"Yatra.com\" border=\"0\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:25px;color:#ea2330\" vspace=\"0\" hspace=\"0\" align=\"center\"></a>\r\n                                    </td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td height=\"10\"></td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td>\r\n                                        <table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\" style=\"border:1px solid #d8d8d8\">\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td width=\"10\"></td>\r\n                                                <td colspan=\"3\"><b>Travel Request Details</b></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td width=\"10\"></td>\r\n                                                <td>\r\n                                                    <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"border:1px solid #d8d8d8\">\r\n                                                        <tbody>\r\n                                                        <tr>\r\n                                                            <td width=\"10\"></td>\r\n                                                            <td>\r\n                                                                <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\">\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Email Verification Date / Time </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n                                                                    </tr id='aaaaa'>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Request Submission Date / Time </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Product </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Flight</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Journey Type </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">One way</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Adult </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">1</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Child </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Infant </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Flight Class </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Travel Class</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Preferred Airline </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            </td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Non Stop Flight </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Airline</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Traveller Email </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">ankityadav56@demo.com</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Traveller Mobile</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">9971255462</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Travel Policy Email</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Corporate.traveler@yatra.com</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Origin</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">New Delhi(DEL)</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Destination</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Mumbai(BOM)</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Depart Date</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">26 Jun 2020</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Time From</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">00:23</td>\r\n                                                                    </tr>\r\n\r\n                                                                </table>\r\n                                                            </td>\r\n                                                            <td width=\"10\"></td>\r\n                                                        </tr>\r\n\r\n                                                        </tbody>\r\n                                                    </table>\r\n\r\n                                                </td>\r\n                                                <td width=\"10\"></td>\r\n                                            </tr>\r\n\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                        </table>\r\n\r\n                                    </td>\r\n                                </tr>\r\n                            </table>\r\n                        </td>\r\n                    </tr>\r\n                </table>\r\n            </td>\r\n        </tr>\r\n    </table>\r\n\r\n</div>\r\n\r\n</body>\r\n\r\n</html>

With clean string the parser works like this-

>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

>>> parser = etree.HTMLParser()
>>> tree   = etree.parse(StringIO(broken_html), parser)

>>> result = etree.tostring(tree.getroot(),
...                         pretty_print=True, method="html")
>>> print(result)
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <h1>page title</h1>
  </body>
</html>
h s
  • 404
  • 1
  • 5
  • 17

2 Answers2

1

Maybe you want to use BeautifulSoup? It's a framework which structures the code so you can iterate over it. You can also search for specific tags, classes and so on. Ps. One of the parser options for it is lxml.

from bs4 import BeautifulSoup
soup = BeautifulSoup(broken_html, 'lxml')
soup.titel  # returns <title>Titel</title>
soup.find_all('div')  # returns an array with all div tags
my_tag = soup.find(id="yourID")
my_tag.find_all('div')  # returns you every div tag in the tag with the id yourID
Moe
  • 62
  • 10
  • I added some examples in my post. – Moe May 12 '20 at 16:36
  • Can we use Xpath in it? or can you tell me how to select nested elements? – h s May 12 '20 at 16:39
  • This is the xpath that I wanted to select from my html- "//tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr" It will return a table from html. From that table I want to iterate over all elements. – h s May 12 '20 at 16:45
  • Ok I found I workaround to extract the information that I wanted. Thanks for your help. – h s May 12 '20 at 17:35
0

Looks like you need to un-escape your string first, thus have a look at ChristopheD's answer.

html_unescaped_string = html_escaped_string.decode('string_escape')

Then you can indeed use BeautifulSoup and cross your fingers it finds it's way among other malformed bits of the string.

M. Hardy
  • 154
  • 9