1

I have to create dictinary for class values from input HTML.

Input:

  <div>
     <p id="quarter-line-below1" class="firstpara-rw"><span class="dropcap-image-rw print-exclude-rw"><img alt="2014" src="243864_20.png"/></span><span class="dropcap-rw">2014 </span>has had some .............</p>
     <p id="firstpara1" class="firstpara-rw"><span class="dropcap-image-rw print-exclude-rw"><img alt="O" src="243864_69.png"/></span><span class="dropcap-rw">O</span>f course ...........</p>
     <p class="test1-image-rw print-exclude-rw" id="ornament1-orn"><img src="243865_18.png" /></p>
     <p class="test1-rw" id="ornament1">aA bB cC dD eE fF gG hH iI</p>
     <p class="test2-image-rw print-exclude-rw" id="ornament1-orn"><img src="243865_18.png" /></p>
     <p class="test22-rw" id="ornament1">aA bB cC dD eE fF gG hH iI</p>
 </div>

My Algo is:

  1. Parse content by LXML.
  2. Get all tags which class value contains -image-rw by xpath
  3. Iterate on every tag from the step2.
  4. Get Target class value which contains -image-rw and its respective value means remove -image from class value.
  5. Get next tag of target tag.
  6. Check target value is present in the net tag or not.
  7. If present then add to dictionary.

Code:

import lxml.html as PARSER
import time

start_time = time.time()
root = PARSER.fromstring(content)
target_tags = root.xpath("//*[contains(@class, '-image-rw')]")
valid_class = {}
#- Validation.
for i in target_tags:
    target_class = [j.strip() for j in  i.attrib["class"].split() if "-image-rw" in j][0].strip()
    target_class_next = target_class.replace("-image-rw", "-rw")
    try:
       for j in i.getnext().attrib["class"].split():
           print j
           if j.strip()==target_class_next:
               valid_class[target_class] = target_class_next
               break
    except KeyError:
        print "Class value missing. ", i

print "Time:-", time.time() - start_time
print "Result:-", valid_class

Output:

Time:- 0.000622987747192
Result:- {'test1-image-rw': 'test1-rw', 'dropcap-image-rw': 'dropcap-rw'}

Is any other Pythonic and Optimized way to get above result?

Vivek Sable
  • 9,938
  • 3
  • 40
  • 56
  • You may want to look at http://stackoverflow.com/questions/6325216/parse-html-table-to-python-list or http://stackoverflow.com/questions/11901846/beautifulsoup-a-dictionary-from-an-html-table ? – boardrider May 26 '15 at 11:00

0 Answers0