I'm building a webcrawler in Python using beautiful soup to crawl wikipedia. The problem is that wikipedia has a lot of garbage links that I don't want to look at.
For example:
target links with #
before the target part
<li class="toclevel-1 tocsection-1">
<a href="#Overview">
<span class="tocnumber">1</span>
<span class="toctext">Overview</span>
</a>
</li>
talk pages
<li class="nv-talk">
<a href="/wiki/Template_talk:Data_structures" title="Template talk:Data structures">
<span title="Discuss this template" style=";;background:none transparent;border:none;;">t</span>
</a>
</li>
template pages
<li class="nv-view">
<a href="/wiki/Template:Data_structures" title="Template:Data structures">
<span title="View this template" style=";;background:none transparent;border:none;;">v</span>
</a>
</li>
and so on...
Now, I'm storing all the links I've already visited in a dictionary so I don't visit them twice, so I can avoid the target links by simply checking if the link up to the #
symbol is already in the table.
I'm having a little more trouble with talk, template, and other such pages, however.
Something unique about them is that they always appear within an <li>
tag, with some class attribute ("nv-talk"
, "nv-view"
etc), however my crawler relies on looking at the <a>
tags, so I don't have access to the attributes of the <li>
tag within which it is contained.
Furthermore, not all links on a page are contained within an <li>
tag, so I can't simply search for <li>
tags instead.
Any ideas?