7

I am doing a small project where I extract occurences of political leaders in newspapers. Sometimes a politician will be mentioned, and there is neither a parent or child with a link. (due I guess to semantically bad markup).

So I want to create a function that can find the nearest link, and then extract that. In the case below the search string is Rasmussen and the link I want is: /307046.

#-*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find)) 

def find_nearest(element, url, direction="both"):
    """Find the nearest link, relative to a text string.
    When complete it will search up and down (parent, child),
    and only X levels up down. These features are not implemented yet.
    Will then return the link the fewest steps away from the
    original element. Assumes we have already found an element"""

    # Is the nearest link readily available?
    # If so - this works and extracts the link.
    if element.find_parents('a'):
        for artikel_link in element.find_parents('a'):
            link = artikel_link.get('href')
            # sometimes the link is a relative link - sometimes it is not
            if ("http" or "www") not in link:
                link = url+link
                return link
    # But if the link is not readily available, we will go up
    # This is (I think) where it goes wrong
    # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
    if not element.find_parents('a'):
        element =  element.parent
        # Print for debugging
        print element #on the 2nd run (i.e <li> this finds <a href=/307056> 
        # So shouldn't it be caught as readily available above?
        print u"Found: %s" % element.name
        # the recursive call
        find_nearest(element,url)

# run it
if contexts:
    for a in contexts:
        find_nearest( element=a, url="http://information.dk")

The direct call below works:

print contexts[0].parent.parent.parent.a['href'].encode('utf-8')

For reference the whole sorry code is on bitbucket: https://bitbucket.org/achristoffersen/politikere-i-medierne

(p.s. Using BeautifullSoup 4)


EDIT: SimonSapin asks me to define nearest: By nearest I mean the link that are the fewest nesting levels away from the search term, in either direction. In the text above, the a href produced by the drupal-based newspaper site, is neither a direct parent or child of the tag where the search string is found. So BeautifullSoup Can't find it.

I suspect a 'fewest charachters' away would often work too. In that case a soulution could be hacked together with find and rfind - but I would really like to do this via BS. Since this would work: contexts[0].parent.parent.parent.a['href'].encode('utf-8') it must be possible to generalise that to a script.

EDIT: Maybe I should emphasize that I'am looking for a BeautifulSoup solution. Combining BS with a custom/simpel breath-first-search as suggested by @erik85 would quickly become messy I think.

Andreas
  • 6,612
  • 14
  • 59
  • 69
  • How do you define "near", and what is your question? – Simon Sapin Aug 03 '12 at 08:49
  • And my question is: how do i extract this lin/ whats wrong with my code. Thanks – Andreas Aug 03 '12 at 14:00
  • Hmmm - I really don't know how I should award the bounty. Unutbu has given me roundrobin and erikb85 seems to have struck a chord with fellow coders (11 upvotes). However - none has answered what is specifically wrong with my code. Unutbu comes closest I guess. And since he has not been awarded any upvotes, I think I'll give him the bounty... Will decide to night danish time - an hour before the bounty ends. – Andreas Aug 10 '12 at 12:36

2 Answers2

12

Someone will probably come up with a solution that works with copy&paste and you will think that this solves your problem. Your problem isn't the code, though! It's your strategy. There is a software design principle called "divide and conquer" which you should apply in a redesign to your code: Seperate the code that interprets your HTML strings as trees/graphs from searching the nearest node (probably breadth-first-search). Not only will you learn to design better software, your problem will probably just cease to exist.

I think you are smart enough to solve this yourself, yet I also want to provide a skeleton:

def parse_html(txt):
    """ reads a string of html and returns a dict/list/tuple presentation"""
    pass

def breadth_first_search(graph, start, end):
    """ finds the shortest way from start to end
    You can probably customize start and end to work well with the input you want
    to provide. For implementation details see the link in the text above.
    """
    pass

def find_nearest_link(html,name):
    """putting it all together"""
    return breadth_first_search(parse_html(html),name,"link")

PS: Doing this also applies another principle, but from mathematics: Assuming there is a problem you don't know a solution to (finding links close to a chosen substring) and there are a group of problems that you know a solution to (graph traversal), then try to transform your problem to match the group of problems you can solve, so you can just use basic solution patterns (that are probably even implemented already in a language/framework of your choice) and you are done.

Community
  • 1
  • 1
erikbstack
  • 12,878
  • 21
  • 81
  • 115
  • 2
    +1 very good conceptual answer for someone who probably needs the principle more than the code. – msw Aug 04 '12 at 13:32
  • Thanks. While I appreciate your answer alot, I also think that it misses the mark a tiny bit: You are right that I should (and I will) look at the 'divide and conquer' principle (E.g. I seemt to remember from Learn python the hard way that more than two nesting levels is not good). But I would like to use BS here, and I don't think your answer will help me with understanding why I can't generalize `contexts[0].parent.parent.parent.a['href']`. However - all good advise deserves an upvote at the very least :-) - p.s. I haven't read the wiki artikel yet... – Andreas Aug 04 '12 at 13:46
2

Here is a solution using lxml. The main idea is to find all the preceding and following elements and then do a roundrobin iteration through those elements:

def find_nearest(elt):
    preceding = elt.xpath('preceding::*/@href')[::-1]
    following = elt.xpath('following::*/@href')
    parent = elt.xpath('parent::*/@href')
    for href in roundrobin(parent, preceding, following):
        return href

A similar solution using BeautifulSoups' (or bs4's) next_elements and previous_elements should also be possible.


import lxml.html as LH
import itertools

def find_nearest(elt):
    preceding = elt.xpath('preceding::*/@href')[::-1]
    following = elt.xpath('following::*/@href')
    parent = elt.xpath('parent::*/@href')
    for href in roundrobin(parent, preceding, following):
        return href

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # http://docs.python.org/library/itertools.html#recipes
    # Author: George Sakkis
    pending = len(iterables)
    nexts = itertools.cycle(iter(it).next for it in iterables)
    while pending:
        try:
            for n in nexts:
                yield n()
        except StopIteration:
            pending -= 1
            nexts = itertools.cycle(itertools.islice(nexts, pending))

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
doc = LH.fromstring(tekst)

for x in doc.xpath('//*[contains(text(),{s!r})]'.format(s = to_find)):
    print(find_nearest(x))

yields

/307046
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks - so this finds the nearest element. What if there is a match in both directions 3 iterations down? Will both links be returned? – Andreas Aug 10 '12 at 12:39
  • It will return only the first nearest element. The `roundrobin` iterator yields an element from `parent`, then `preceding`, then `following`. So if there are more than one nearest element, the one in `parent` has highest precedence, then the one in `preceding` and then `following`. – unutbu Aug 10 '12 at 16:55