lxml: split at attribute?

Question

I'm using lxml to scrape some HTML that looks like this:

<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>

How can I end up with data in the form

[ {'category': 'Football', 'title': 'Team A'},
{'category': 'Football', 'title': 'Team B'},
{'category': 'Baseball', 'title': 'Team C'},
{'category': 'Baseball', 'title': 'Team D'}]

So far I've got:

results = []
for (i,a) in enumerate(content[0].xpath('./a')):
     data['text'] = a.text
     results.append(data)

But I don't know how to get the category name by splitting at font-size and retaining sibling tags - any advice?

Thanks!

I am not sure what data you are missing - the result seems ok to me. — miku, Jun 13 '11 at 12:46
sorry, missed that *How can* in *I end up with data in the form* ... — miku, Jun 13 '11 at 12:50
Sanity check: do you control the HTML enough to be _sure_ it will be proper XML — Foon, Jun 13 '11 at 13:12

miku · Accepted Answer · 2011-06-13T13:39:08.457

I had success with the following code:

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []
current_category = None

for element in body.xpath('./*'):
    if element.tag == 'div':
        current_category = element.xpath('./a')[0].text
    elif element.tag == 'a':
        results.append({ 'category' : current_category, 
            'title' : element.text })

print results

It will print:

[{'category': 'Football', 'title': 'Team A'}, 
 {'category': 'Football', 'title': 'Team B'}, 
 {'category': 'Baseball', 'title': 'Team C'}, 
 {'category': 'Baseball', 'title': 'Team D'}]

Scraping is fragile. Here for example, we depend explicitly on the ordering of the elements as well as the nesting. However, sometimes such a hardwired approach might be good enough.

Here is another (more xpath-oriented approach) using the preceding-sibling axis:

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []

for e in body.xpath('./a'):
    results.append(dict(
        category=e.xpath('preceding-sibling::div/a')[-1].text,
        title=e.text))

print results

Genius, thanks. Yes, on my actual page `preceding-sibling` works better! — Richard, Jun 13 '11 at 15:52
I now realise my error: trying to work out what to do from the lxml documentation, not the xpath documentation! — Richard, Jun 13 '11 at 15:52

score 1 · Answer 2 · answered Jun 13 '11 at 14:00

Also if you are looking for other ways(just an option - don't beat me too much) how to do this or you don't have ability to import lxml you can use the following weird code:

text = """
            <a href="">Team YYY</a>
            <div align=center><a style="font-size: 1.1em">Polo</a></div>
            <div align=center><a style="font-size: 1.1em">Football</a></div>
            <a href="">Team A</a>
            <a href="">Team B</a>
            <div align=center><a style="font-size: 1.1em">Baseball</a></div>
            <a href="">Team C</a>
            <a href="">Team D</a>
            <a href="">Team X</a>
            <div align=center><a style="font-size: 1.1em">Tennis</a></div>
        """
# next variables could be modified depending on what you really need        
keyStartsWith = '<div align=center><a style="font-size: 1.1em">'
categoryStart = len(keyStartsWith)
categoryEnd = -len('</a></div>')
output = []
data = text.split('\n')    
titleStart = len('<a href="">')
titleEnd = -len('</a>')

getdict = lambda category, title: {'category': category, 'title': title}

# main loop
for i, line in enumerate(data):
    line = line.strip()
    if keyStartsWith in line and len(data)-1 >= i+1:
        category = line[categoryStart: categoryEnd]
        (len(data)-1 == i and output.append(getdict(category, '')))
        if i+1 < len(data)-1 and keyStartsWith in data[i+1]:
            output.append(getdict(category, ''))
        else:
            while i+1 < len(data)-1 and keyStartsWith not in data[i+1]:
                title = data[i+1].strip()[titleStart: titleEnd]
                output.append(getdict(category, title))
                i += 1

No offense - this might be correct, but it's way too complicated. — miku, Jun 13 '11 at 14:02
@miku - yes, i know, your solution is more simple - and that is why i vote for it, i simply put my solution here like an option for those who are not able to use your solution due to any local reason. — Artsiom Rudzenka, Jun 13 '11 at 14:07
sure, and I won't downvote. But in general if you try to do anything similar to parsing HTML, you should work with a dedicated library - people even try to parse HTML with regexes and then funny things happen - see: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — miku, Jun 13 '11 at 14:10
@miku - I am completely agree with you. Also i believe that html Vs regex is a battle between evil and heaven) @miku - I am completely agree with you. Also i believe that html Vs regex is a battle between evil and heaven) About SO thread - it is probably the most favorite thread in our company) — Artsiom Rudzenka, Jun 13 '11 at 14:18

lxml: split at attribute?

2 Answers2