2

I am trying to parse value form html using python with lxml and xpath.

Here is my html data

<table>
<tr>
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td>
      <td class="u">
       <select name="record[13][type]">
         <option SELECTED value="A" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td>

<td class="u"><input class="wide" name="record[14][name]" value="exampledomain2.com"></td>
      <td class="u">
       <select name="record[14][type]">
         <option SELECTED value="CNAME" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[14][content]" value='exampledomain1.com'></td>

<td class="u"><input class="wide" name="record[15][name]" value="exampledomain3.com"></td>
      <td class="u">
       <select name="record[15][type]">
         <option SELECTED value="A" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[15][content]" value='10.10.10.3'></td>
</tr>
</table>

what I want is to parse value and print as below:

exampledomain1.com A 10.10.10.1
exampledomain2.com CNAME exampledomain1.com
exampledomain3.com A 10.10.10.3

Here is what I tried

#!/usr/bin/python
import lxml.html
from lxml import etree

doc = lxml.html.document_fromstring("""Here whole html data""")
txt1 = doc.xpath('//*[@class="wide"]/@value')
txt2 = doc.xpath('//@SELECTED/text()')
print txt1
print txt2

But its not working as I wanted. Any help would be appreciated.

Thank You all.

Mike Pennington
  • 41,899
  • 19
  • 136
  • 174
Manish
  • 21
  • 1
  • 3

2 Answers2

3

I fixed the code to return the following, which is very close to what you asked for:

(py26_default)[mpenning@Bucksnort ~]$ python parse.py
exampledomain1.com 10.10.10.1
exampledomain2.com exampledomain1.com
exampledomain3.com 10.10.10.3
(py26_default)[mpenning@Bucksnort ~]$

You cannot retrieve record[13][type] with xpath... there are other ways to iterate through this, but I will leave this as an exercise for the OP. Note that I did fix the HTML in the OP's question to include <table> and <tr> tags...

import lxml.html
from lxml import etree
from lxml.etree import XMLParser

parser = XMLParser(ns_clean=True, recover=True)
doc = etree.fromstring("""Here whole html data""", parser)
elem1 = doc.xpath('//input[@name="record[13][name]"]')
# NOTE: <option SELECTED> cannot be retrieved with xpath... SELECTED must have
#   a value to do so...
#elem2 = doc.xpath('//select[@name="record[13][type]"]/option[@SELECTED]')
elem3 = doc.xpath('//input[@name="record[13][content]"]')

for idx, val in enumerate(elem1):
    print val.attrib['value'], elem3[idx].attrib['value']

<!-- The (fixed) html source I used -->
<table>
<tr>
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td>
      <td class="u">
       <select name="record[13][type]">
         <option SELECTED value="A" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td>

<td class="u"><input class="wide" name="record[13][name]" value="exampledomain2.com"></td>
      <td class="u">
       <select name="record[13][type]">
         <option SELECTED value="CNAME" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[13][content]" value='exampledomain1.com'></td>

<td class="u"><input class="wide" name="record[13][name]" value="exampledomain3.com"></td>
      <td class="u">
       <select name="record[13][type]">
         <option SELECTED value="A" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.3'></td>
</tr>
</table>
Mike Pennington
  • 41,899
  • 19
  • 136
  • 174
  • Hi Mike, the field "name="record[13]" is changing for all of those other dns records records, which I have corrected in this html code. So in this case the //input[@name="record[13][name]"]' will not catch all the record with different numbers. So how I can define wildcard in it or range. – Manish Aug 01 '12 at 15:01
  • You could use [an `lxml` regex](http://stackoverflow.com/a/2756994/667301) to solve this problem – Mike Pennington Aug 01 '12 at 15:26
  • Thank You Mike, Well I got that working with regex but still stuck on getting SELECTED value. – Manish Aug 02 '12 at 16:13
  • I am talking about this code ` – Manish Aug 02 '12 at 16:41
  • 1
    I have helped enough. This answer demonstrates how to solve your problem; however, I cannot solve all the problems and this is part of your job. You need to rise to the challenge – Mike Pennington Aug 02 '12 at 16:48
  • Hi Mike, Thank you for all your help and I just started playing around with python and its been a 15 days, learned a few things and enjoying it. I got it some how closer to what I want. I defined this search string to get selected value and code is as `elem4 = doc.xpath(r'''//select[re:match(@name, "record\[[0-9]{1,3}\]\[type\]")]/option/following-sibling::text()''', namespaces={'re': 'http://exslt.org/regular-expressions'})` `print val.attrib['value'], elem4[idx], elem3[idx].attrib['value']` and I get output as `exampledomain1.com value="A" >A 10.10.10.1` still poking around to get only A – Manish Aug 03 '12 at 14:59
0
record_13_name = tree.xpath("//select[@name='record[13][name]']/text()")
record_13_type = tree.xpath("//select[@name='record[13][type]']/option/text()")
record_13_content = tree.xpath("//input[@name='record[13][content]']/text()")


record_14_name = tree.xpath("//select[@name='record[14][name]']/text()")
record_14_type = tree.xpath("//select[@name='record[14][type]']/option/text()")
record_14_content = tree.xpath("//input[@name='record[14][content]']/text()")


record_15_name = tree.xpath("//select[@name='record[15][name]']/text()")
record_15_type = tree.xpath("//select[@name='record[15][type]']/option/text()")
record_15_content = tree.xpath("//input[@name='record[15][content]']/text()") 
Saurabh Chandra Patel
  • 12,712
  • 6
  • 88
  • 78