Use python LXML to extract information from html webpage

Question

I am trying to make a python script to scrape specific information from a webpage with the limited knowledge I have. But I guess my limited knowledge is not suffice. I need to extract 7-8 pieces of information. The tags are as follows -

1

<a class="ui-magnifier-glass" href="here goes the link that i want to extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"></a>

2

<a href="link to extract" title="title to extract" rel="category tag" data-spm-anchor-id="0.0.0.0">or maybe this word instead of title</a>

If i get an idea how to extract information from such href tags. I will be able to do rest of the work myself.

And also if someone could help me in writing a code to add this information in a csv file would be highly appreciated.

I have started with this code

url = raw_input('url : ')

page = requests.get(url)
tree = html.fromstring(page.text)
productname = tree.xpath('//h1[@class="product-name"]/text()')
price = tree.xpath('//span[@id="sku-discount-price"]/text()')
print '\n' + productname[0]
print '\n' + price[0]

Do you want the way of parsing using `Beautifulsoup` since you have tagged it here? I think parsing with `Beautifulsoup` is the easiest so far. — Sam Al-Ghammari, Jul 18 '15 at 02:01

sabertiger · Answer 1 · 2015-07-16T21:07:12.527

2

You can use the lxml and csv module to do what you want. lxml supports xpath expressions to select the elements you want.

from lxml import etree
from StringIO import StringIO
from csv import DictWriter

f= StringIO('''
    <html><body>
    <a class="ui-magnifier-glass" 
       href="here goes the link that i want to extract" 
       data-spm-anchor-id="0.0.0.0" 
       style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
    ></a>
    <a href="link to extract"
       title="title to extract" 
       rel="category tag" 
       data-spm-anchor-id="0.0.0.0"
    >or maybe this word instead of title</a>
    </body></html>
''')
doc = etree.parse(f)

data=[]
# Get all links with data-spm-anchor-id="0.0.0.0" 
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')

# Iterate thru each element containing an <a></a> tag element
for elem in r:
    # You can access the attributes with get
    link=elem.get('href')
    title=elem.get('title')
    # and the text inside the tag is accessable with text
    text=elem.text

    data.append({
        'link': link,
        'title': title,
        'text': text
    })

with open('file.csv', 'w') as csvfile:
    fieldnames=['link', 'title', 'text']
    writer = DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for row in data:
        writer.writerow(row)

edited Jul 16 '15 at 21:07

answered Jul 16 '15 at 20:41

sabertiger

428
2
7

Thankyou so much! Is there a way by which i can take all data in different variables, add them up in a dictionary or a list. and then append it to the csv? – Arushi Chopra Jul 16 '15 at 20:56
It's already doing so. I added more comments and refactored it for clarity. You should run this under the interactive python if you haven't already done so. It allows you see what's going on line by line and examine the intermediate states. – sabertiger Jul 16 '15 at 21:11
Yes i have ran the code code. But the problem is that it adds 3 rows of same data in the csv – Arushi Chopra Jul 16 '15 at 21:14
Maybe this is happening because data is a list and and it is being used as a dictionary? – Arushi Chopra Jul 16 '15 at 21:18
The example looks for all a elements with attribute data-spm-anchor-id="0.0.0.0". Since there are two elements, there's a corresponding number of data rows. The first row is the header line that tells you what the columns contain, which can be omitted by removing writer.writeheader(). – sabertiger Jul 16 '15 at 21:20

score 0 · Answer 2 · edited May 23 '17 at 11:53

Here is how to extact by id using lxml and some stuff with curl:

curl some.html | python extract.py

extract.py:

from lxml import etree
import sys
# grab all elements with id == 'postingbody'
pb = etree.HTML(sys.stdin.read()).xpath("//*[@id='postingbody']")
print(pb)

some.html:

<html>
    <body>
        <div id="nope">nope</div>
        <div id="postingbody">yep</div>
    </body>
</html>

Also see:

XPath to select Element by attribute value

Use python LXML to extract information from html webpage

2 Answers2