Match neighbour tag by Class values

Question

I have to create dictinary for class values from input HTML.

Input:

  <div>
     <p id="quarter-line-below1" class="firstpara-rw"><span class="dropcap-image-rw print-exclude-rw"><img alt="2014" src="243864_20.png"/></span><span class="dropcap-rw">2014 </span>has had some .............</p>
     <p id="firstpara1" class="firstpara-rw"><span class="dropcap-image-rw print-exclude-rw"><img alt="O" src="243864_69.png"/></span><span class="dropcap-rw">O</span>f course ...........</p>
     <p class="test1-image-rw print-exclude-rw" id="ornament1-orn"><img src="243865_18.png" /></p>
     <p class="test1-rw" id="ornament1">aA bB cC dD eE fF gG hH iI</p>
     <p class="test2-image-rw print-exclude-rw" id="ornament1-orn"><img src="243865_18.png" /></p>
     <p class="test22-rw" id="ornament1">aA bB cC dD eE fF gG hH iI</p>
 </div>

My Algo is:

Parse content by LXML.
Get all tags which class value contains -image-rw by xpath
Iterate on every tag from the step2.
Get Target class value which contains -image-rw and its respective value means remove -image from class value.
Get next tag of target tag.
Check target value is present in the net tag or not.
If present then add to dictionary.

Code:

import lxml.html as PARSER
import time

start_time = time.time()
root = PARSER.fromstring(content)
target_tags = root.xpath("//*[contains(@class, '-image-rw')]")
valid_class = {}
#- Validation.
for i in target_tags:
    target_class = [j.strip() for j in  i.attrib["class"].split() if "-image-rw" in j][0].strip()
    target_class_next = target_class.replace("-image-rw", "-rw")
    try:
       for j in i.getnext().attrib["class"].split():
           print j
           if j.strip()==target_class_next:
               valid_class[target_class] = target_class_next
               break
    except KeyError:
        print "Class value missing. ", i

print "Time:-", time.time() - start_time
print "Result:-", valid_class

Output:

Time:- 0.000622987747192
Result:- {'test1-image-rw': 'test1-rw', 'dropcap-image-rw': 'dropcap-rw'}

Is any other Pythonic and Optimized way to get above result?

You may want to look at http://stackoverflow.com/questions/6325216/parse-html-table-to-python-list or http://stackoverflow.com/questions/11901846/beautifulsoup-a-dictionary-from-an-html-table ? — boardrider, May 26 '15 at 11:00

Match neighbour tag by Class values

0 Answers0