3

I'm writing a plugin for xbmc in python. I have got a list of strings in the format:
<a href="/www.link.to/something">name of link</a>

By using beautiful stone soup (the relevant part of the code):

 soup = BeautifulStoneSoup(link, convertEntities=BeautifulStoneSoup.XML_ENTITIES)
    programs = soup('ul')
    i = 0
    for prog in programs:
        i = i+1
        if i==(5+getLetterValue(name)):
            j = 0
            while j < len(prog('li')):
                li = prog('li')[j]
                link = li('a')[0]

getLeterValue is a function that returns an index which indidcates where this specific 'ul' tag is placed (according to the desired letter).

Now I want to split link in the link and text. I tried using re.compile:
match=re.compile('<a href="(.+?)">(.+?)</a>').findall(link.string)
but all I get is match=[]

What have I done wrong?

Note: I know I should regexp html code but I'm not sure this ``rule'' is valid for small string. Also, for some reason this is almost a standard in xbmc plugin writing and I assume there is some reason for that.

Yotam
  • 10,295
  • 30
  • 88
  • 128
  • If **link.string** is like **name of link** , the regex's pattern is correct to match them. But don't call an object with the identifier 'match', I don't think that you override the **re**'s method **match**, but that's dangerous – eyquem Aug 28 '11 at 20:06
  • You should use ``for i,prog in enumerate(programs):`` – eyquem Aug 28 '11 at 20:14

2 Answers2

2

Why not let BeautifulSoup give you the href attribute and the element contents?

Ross Patterson
  • 5,702
  • 20
  • 38
  • Great tool. However, I still need the string, ``name of link`` in my question. – Yotam Aug 29 '11 at 05:51
  • That's also in the *same docs*. Edited the answer with a paste from the *docs*. – Ross Patterson Aug 29 '11 at 07:27
  • I have found about contents about 15 minutes before you answered me, thanks. I still have a problem though. I think it has something to do with the Hebrew with the webpage. The answer I get is in the format of [u'\u50e0...'] and I can't figure how to convert that to a unicode string. – Yotam Aug 29 '11 at 09:00
  • Nope, it didn't work. I toyed around with it and I couldn't have convert this into Hebrew. I'll ask a new question – Yotam Aug 29 '11 at 19:33
0

The easiest way is to use lxml:

from lxml.html import fromstring

elem = fromstring(link.string)
print elem.attrib["href"]
print elem.text
Gabi Purcaru
  • 30,940
  • 9
  • 79
  • 95
  • **lxml** is slower than BeautifulSoup , which is itself slower than pure regex. One time I measured **lxml** being 100 times slower than a code using uniquely regexes. – eyquem Aug 28 '11 at 20:45