0

Assuming html source are as follows:

<html><body>
<div class="aname">
    <div class="bname">
        <h5><a href="url_a0" class="cname">aTitle</a></h5>
    </div>
    <div class="">
        <div><img src="url_a1"/>img_text<br /></div>
        <div><strong>label_a1:</strong>text_a1<br /></div>
        <div><strong>label_a2:</strong>text_a2<br /></div>
        <div><strong>label_spe:<a href="url_a2">*</a>:</strong>
            <span class="box-span" >spantext_a1</span>
            <span class="box-span" >spantext_a2</span>
            <span class="box-span" >spantext_a3</span><br />
        </div>              
    </div>
</div>

<div class="aname">
    <div class="bname">
        <h5><a href="url_a0" class="cname">aTitle</a></h5>
    </div>
    <div class="">
        <div><img src="url_a1"/>img_text<br /></div>
        <div><strong>label_a1:</strong>text_b1<br /></div>
        <div><strong>label_a2:</strong>text_b2<br /></div>
        <div><strong>label_a3:</strong>text_b3<br /></div>
        <div><strong>label_spe:<a href="url_b3">*</a>:</strong>
            <span class="box-span" >spantext_b1</span>
            <span class="box-span" >spantext_b2</span>
            <span class="box-span" >spantext_b3</span>
            <span class="box-span" >spantext_b4</span>
            <span class="box-span" >spantext_b5</span>
            <span class="box-span" >spantext_b6</span><br />
        </div>              
    </div>
</div>
</body></html>

if I want the output are:

aTitle
url_a0
label_a1:
text_a1
label_a2:
text_a2
label_spe:
spantext_a1
spantext_a2
spantext_a3

aTitle
url_a0
label_a1:
text_b1
label_a2:
text_b2
label_a3:
text_b3
label_spe:
spantext_b1
spantext_b2
spantext_b3
spantext_b4
spantext_b5
spantext_b6

I want to use lxml with python!! Please help me, What should I do? Since html have more than one div, and the number of span are variable. I have try so many times, but still can't get the right output. Finally, I hope I can get some useful info from here!! My code are as below:

# -*- coding:utf-8 -*-
import codecs
import lxml,re
import re
from lxml import etree
from lxml.html.clean import Cleaner

def main():
    pass

if __name__ == '__main__':
    main()

ff = codecs.open('test.html','r',errors='ignore',encoding='utf-8')

    html0 = ff.read()
    html1 = re.sub('<strong>', '',html0)
    html2 = re.sub('</strong>','',html1)
    html  = re.sub('class=\"box-span\"','',html2)

    spelabels = ['img_text', 'label_a1', 'label_a2', 'label_a3']

    root = lxml.html.fromstring(html)
    contents = root.xpath('.//div[@class="aname"]/div[@class=""]/div/text()')
    for content in contents:
         if content[0:8] in spelabels:
              print(content[0:8])
              print(content[9:])
         elif content == "label_spe:":
              print(content)
              nestedcontents = root.xpath('.//div[@class="aname"]/div[@class=""]/div[text()="label_spe:"]/following-sibling::span/text()')          
              print(nestedcontents)
              for nestedcontent in nestedcontents:
                   print(nestcontent)       

the output:

img_text

label_a1
text_a1
label_a2
text_a2
label_spe:
[]
img_text

label_a1
text_b1
label_a2
text_b2
label_a3
text_b3
label_spe:
[]

It seems partly work, but I don't know how to extract the url_a1. The text in span does not appear"

Cœur
  • 37,241
  • 25
  • 195
  • 267

2 Answers2

0

Here's my attempt. It gives your desired output for your sample input. I made it tolerant of certain tag changes like div vs. span.

import xml.etree.cElementTree as etree # or: from lxml import etree

body = etree.parse('test.html').find('body')

for aname in body.iterfind('*[@class="aname"]'):
    cname = aname.find('*[@class="bname"]//a[@class="cname"]')
    print cname.text # title
    print cname.get('href') # url

    for div in aname.iterfind('div[@class=""]/div'):
        strong = div.find('strong')
        if strong is not None:
            print strong.text # label
            text = div[0].tail.strip() # http://stackoverflow.com/a/9674097/4323
            if text:
                print text
            else:
                for box in div.iterfind('*[@class="box-span"]'):
                    print box.text

    print
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Amazing! It works!! Thanks a lot!! I am a newbie in the coding world. – Norman Weng Jan 09 '15 at 05:16
  • I apply you way to my program and modify as follows:ff = codecs.open('test.html',mode='r',errors='ignore',encoding='utf-8') html = ff.read() body = etree.parse(html).find('body'), but it appears error === > Traceback (most recent call last): File "", line 250, in run_nodebug File "E:\Python\test1.py", line 24, in body = etree.parse(html).find('body') File "", line 62, in parse File "", line 26, in parse IOError: [Errno 2] No such file or directory: (output the whold test1.html source), could you tell me, what's wrong? – Norman Weng Jan 09 '15 at 14:06
  • @NormanWeng: that's because `parse()` takes a filename, not the file contents like you are now passing it. You want `fromstring()` instead of `parse()`. – John Zwinck Jan 09 '15 at 14:20
  • Thanks you very much for your reply. I apply you code and just change the filename from test.html to test1.html. Here comes the wrong message : lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 31, column 71,. I change the filename to another html file, then I got the wrong message as the same. What's the problem? What should I do? – Norman Weng Jan 09 '15 at 15:08
  • I find the html source file contains   which will cause error, so I use re.sub(' ','',html) and change text = div[0].tail.strip() to text = div[0].tail ==> It finally work!!! – Norman Weng Jan 09 '15 at 16:13
-2
import re
file1 = open("input.txt",'r')

for lines in file1:
    match = re.findall(">\w*:*<", lines)
    for ele in match:

       a = ele.split('>')
       for ele1 in a:
         b = ele1.split('<')
         for i in b:
           if i and (i !=":"):
            print i

Output:

aTitle
img_text
label_a1:
text_a1
label_a2:
text_a2
label_spe:
spantext_a1
spantext_a2
spantext_a3
aTitle
img_text
label_a1:
text_b1
label_a2:
text_b2
label_a3:
text_b3
label_spe:
spantext_b1
spantext_b2
spantext_b3
spantext_b4
spantext_b5
spantext_b6
Sheshananda Naidu
  • 905
  • 1
  • 9
  • 10
  • WOW!! what a beautiful code!! Many thanks!! It work perfectly! – Norman Weng Jan 09 '15 at 05:30
  • it's Good to hear that. please close the question. – Sheshananda Naidu Jan 09 '15 at 05:36
  • @NormanWeng: this code may be beautiful, but it is also fundamentally the wrong way to do things. Please read the link I posted in the above comment for why using regex to parse HTML is not a good plan...or read any of the dozens of blog posts about the same. Whoever maintains this code after you is not going have nice things to say. – John Zwinck Jan 09 '15 at 07:38
  • @John Zwinck: Thanks for your great help. I have read the link you posted. As I mentioned that I am newbie in coding world. I still have a lot to learn. Any short and beautiful does catch my eye very quickly. But, I apply your way in my case, because the real html source is more complex. Anyway thanks for you kindly advice. – Norman Weng Jan 09 '15 at 07:46