1

I'm building a webcrawler in Python using beautiful soup to crawl wikipedia. The problem is that wikipedia has a lot of garbage links that I don't want to look at.

For example:

target links with # before the target part

<li class="toclevel-1 tocsection-1">
  <a href="#Overview">
    <span class="tocnumber">1</span>
    <span class="toctext">Overview</span>
  </a>
</li>

talk pages

<li class="nv-talk">
  <a href="/wiki/Template_talk:Data_structures" title="Template talk:Data structures">
    <span title="Discuss this template" style=";;background:none transparent;border:none;;">t</span>
  </a>
</li>

template pages

<li class="nv-view">
  <a href="/wiki/Template:Data_structures" title="Template:Data structures">
    <span title="View this template" style=";;background:none transparent;border:none;;">v</span>
  </a>
</li>

and so on...

Now, I'm storing all the links I've already visited in a dictionary so I don't visit them twice, so I can avoid the target links by simply checking if the link up to the # symbol is already in the table.

I'm having a little more trouble with talk, template, and other such pages, however.

Something unique about them is that they always appear within an <li> tag, with some class attribute ("nv-talk", "nv-view" etc), however my crawler relies on looking at the <a> tags, so I don't have access to the attributes of the <li> tag within which it is contained.

Furthermore, not all links on a page are contained within an <li> tag, so I can't simply search for <li> tags instead.

Any ideas?

martin-martin
  • 3,274
  • 1
  • 33
  • 60
Kittenmittons
  • 400
  • 3
  • 14

1 Answers1

2

You can use find_parents() method of BeautifulSoup. This will tell you if a particular tag is within another tag with specified attributes. In this case we are looking for an anchor tag within another tag with nv-talk or nv-view class attribute.

Demo:

html = '''<li class="nv-talk"><a href="/wiki/Template_talk:Data_structures" title="Template talk:Data structures"><span title="Discuss this    template" style=";;background:none    transparent;border:none;;">t</span></a></li>    '''
soup = BeautifulSoup(html)
a_tag = soup.find('a')
a_tag.find_parents(attrs={'class':'nv-talk'})

which gives you:

[<li class="nv-talk"><a href="/wiki/Template_talk:Data_structures" title="Template talk:Data    structures"><span style=";;background:none transparent;border:none;;"    title="Discuss this template">t</span></a></li>]

For every anchor tag in the list of your urls, you can check if find_parents() returns an empty list. If yes, it means this link does not belong to a Talk or a Discuss page and hence safe for your crawling.

Another way to go about this problem would be to see if the href attribute of the anchor tag begins with 'http' or 'https'. But I am not entirely sure if it fits the logic of your code. What I mean by this is, anchor tags with href attributes that begin with # are links to sections within the same page. If you need to ignore these you can look for anchor tags that do not begin with # but instead begin with http or https. This is what I mean:

html = '''
<li class="toclevel-1 tocsection-1"><a href="#Overview"><span class="tocnumber">1</span> <span class="toctext">Overview</span></a></li>
<li class="toclevel-1 tocsection-1"><a href="http://www.google.com"><span class="tocnumber">1</span> <span class="toctext">Overview</span></a></li>
<li class="toclevel-1 tocsection-1"><a href="#Overview"><span class="tocnumber">1</span> <span class="toctext">Overview</span></a></li>
'''
soup = BeautifulSoup(html)
a_tag = soup.find('a', attrs={'href': re.compile(r'^http.*')})

This gives you only the link that begins with http.

shaktimaan
  • 11,962
  • 2
  • 29
  • 33
  • It sounds like that might work. What did you mean by using http vs https to solve the problem? (I'm kinda new to html parsing) – Kittenmittons Apr 13 '14 at 19:41
  • Thanks, I will try it shortly. – Kittenmittons Apr 13 '14 at 20:07
  • @Kittenmittons Sure. Please accept the answer later if it addresses your question. Thanks. – shaktimaan Apr 13 '14 at 20:08
  • will do. Ok so this works for li tags with class attribute. if I do something of the form: a_tag.find_parents(attrs={'class': re.compile('nv')}) ... how about if I want to look for multiple attributes? Like also look for an id attribute (one example would be
  • . Is there a way to search for multiple attributes in the same statement (I'm having trouble with the syntax), or should I simply put them in separate statements and 'or' them together?
  • – Kittenmittons Apr 13 '14 at 21:48
  • @Kittenmittons This has info about it - http://stackoverflow.com/questions/18725760/beautifulsoup-findall-given-multiple-classes – shaktimaan Apr 13 '14 at 22:00