How do I get the link and title from this (part of) html string in python

Question

I'm writing a plugin for xbmc in python. I have got a list of strings in the format:
<a href="/www.link.to/something">name of link</a>

By using beautiful stone soup (the relevant part of the code):

 soup = BeautifulStoneSoup(link, convertEntities=BeautifulStoneSoup.XML_ENTITIES)
    programs = soup('ul')
    i = 0
    for prog in programs:
        i = i+1
        if i==(5+getLetterValue(name)):
            j = 0
            while j < len(prog('li')):
                li = prog('li')[j]
                link = li('a')[0]

getLeterValue is a function that returns an index which indidcates where this specific 'ul' tag is placed (according to the desired letter).

Now I want to split link in the link and text. I tried using re.compile:
match=re.compile('<a href="(.+?)">(.+?)</a>').findall(link.string)
but all I get is match=[]

What have I done wrong?

Note: I know I should regexp html code but I'm not sure this ``rule'' is valid for small string. Also, for some reason this is almost a standard in xbmc plugin writing and I assume there is some reason for that.

If **link.string** is like **name of link** , the regex's pattern is correct to match them. But don't call an object with the identifier 'match', I don't think that you override the **re**'s method **match**, but that's dangerous — eyquem, Aug 28 '11 at 20:06

Ross Patterson · Accepted Answer · 2011-08-29T07:27:30.513

2

Why not let BeautifulSoup give you the href attribute and the element contents?

edited Aug 29 '11 at 07:27

answered Aug 28 '11 at 20:02

Ross Patterson

5,702
20
38

Great tool. However, I still need the string, ``name of link`` in my question. – Yotam Aug 29 '11 at 05:51
That's also in the *same docs*. Edited the answer with a paste from the *docs*. – Ross Patterson Aug 29 '11 at 07:27
I have found about contents about 15 minutes before you answered me, thanks. I still have a problem though. I think it has something to do with the Hebrew with the webpage. The answer I get is in the format of [u'\u50e0...'] and I can't figure how to convert that to a unicode string. – Yotam Aug 29 '11 at 09:00
Nope, it didn't work. I toyed around with it and I couldn't have convert this into Hebrew. I'll ask a new question – Yotam Aug 29 '11 at 19:33

score 0 · Answer 2 · answered Aug 28 '11 at 20:01

0

The easiest way is to use lxml:

from lxml.html import fromstring

elem = fromstring(link.string)
print elem.attrib["href"]
print elem.text

answered Aug 28 '11 at 20:01

Gabi Purcaru

30,940
9
79
95

**lxml** is slower than BeautifulSoup , which is itself slower than pure regex. One time I measured **lxml** being 100 times slower than a code using uniquely regexes. – eyquem Aug 28 '11 at 20:45

How do I get the link and title from this (part of) html string in python

2 Answers2