I am doing a small project where I extract occurences of political leaders in newspapers. Sometimes a politician will be mentioned, and there is neither a parent or child with a link. (due I guess to semantically bad markup).
So I want to create a function that can find the nearest link, and then extract that. In the case below the search string is Rasmussen
and the link I want is: /307046
.
#-*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
tekst = '''
<li>
<div class="views-field-field-webrubrik-value">
<h3>
<a href="/307046">Claus Hjort spiller med mrkede kort</a>
</h3>
</div>
<div class="views-field-field-skribent-uid">
<div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
</div>
<div class="views-field-field-webteaser-value">
<div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
genkomst som statsministe
</div>
</div>
<span class="views-field-view-node">
<span class="actions">
<a href="/307046">Ls mere</a>
|
<a href="/307046/#comments">Kommentarer (4)</a>
</span>
</span>
</li>
'''
to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find))
def find_nearest(element, url, direction="both"):
"""Find the nearest link, relative to a text string.
When complete it will search up and down (parent, child),
and only X levels up down. These features are not implemented yet.
Will then return the link the fewest steps away from the
original element. Assumes we have already found an element"""
# Is the nearest link readily available?
# If so - this works and extracts the link.
if element.find_parents('a'):
for artikel_link in element.find_parents('a'):
link = artikel_link.get('href')
# sometimes the link is a relative link - sometimes it is not
if ("http" or "www") not in link:
link = url+link
return link
# But if the link is not readily available, we will go up
# This is (I think) where it goes wrong
# ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
if not element.find_parents('a'):
element = element.parent
# Print for debugging
print element #on the 2nd run (i.e <li> this finds <a href=/307056>
# So shouldn't it be caught as readily available above?
print u"Found: %s" % element.name
# the recursive call
find_nearest(element,url)
# run it
if contexts:
for a in contexts:
find_nearest( element=a, url="http://information.dk")
The direct call below works:
print contexts[0].parent.parent.parent.a['href'].encode('utf-8')
For reference the whole sorry code is on bitbucket: https://bitbucket.org/achristoffersen/politikere-i-medierne
(p.s. Using BeautifullSoup 4)
EDIT: SimonSapin asks me to define nearest: By nearest I mean the link that are the fewest nesting levels away from the search term, in either direction. In the text above, the a href
produced by the drupal-based newspaper site, is neither a direct parent or child of the tag where the search string is found. So BeautifullSoup Can't find it.
I suspect a 'fewest charachters' away would often work too. In that case a soulution could be hacked together with find and rfind - but I would really like to do this via BS. Since this would work: contexts[0].parent.parent.parent.a['href'].encode('utf-8')
it must be possible to generalise that to a script.
EDIT: Maybe I should emphasize that I'am looking for a BeautifulSoup solution. Combining BS with a custom/simpel breath-first-search as suggested by @erik85 would quickly become messy I think.