0

I'm working on a graph project where I want an efficient way to grab all the links to other English wikipedia articles from a particular English wikipedia article.

Currently, I'm using bs4 and Python, but I don't know too much about bs4.

Here's what I have right now:

##### Imports #####
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
 

##### Functions #####
parser = 'html.parser'
resp = requests.get("https://en.wikipedia.org/wiki/Influenza")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type','').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
 
for link in soup.find_all('a', href=True):
     print(link['href'])

The problem with this is that I'm getting many unwanted links (non-english links or non-article links). I don't think I know enough about HTML to fix this, and I don't want to simply filter through every link given to me from the above find_all() call as that would be inefficient.

Any advice would be greatly appreciated. Thanks in advance!

  • https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup should show some light on the issue – Paul Brennan Dec 15 '20 at 18:50
  • If you look closely, you can notice that all links on Wikipedia articles have `href` starting as `/wiki/...`. – Alexey S. Larionov Dec 15 '20 at 18:50
  • Non-English links are typically those to the left of a page - with links to translations, their `href` doesn't start with `/wiki/`, so no problem – Alexey S. Larionov Dec 15 '20 at 18:51
  • @AlexLarionov this seems useful, but I'm not sure where to go from here. I tried doing soup.findall(href=re.compile("/wiki/")) but I'm still getting the non-english pages – Caleb Bynum Dec 15 '20 at 20:10
  • @CalebBynum open some Wiki page and find that unwanted non-english link. Click RMB and open something like "Inspect element", where you'll see how the `a` element looks like in actual HTML. You may probably find some `class`/`id`/`href` attribute that would distinguish English from non-English – Alexey S. Larionov Dec 15 '20 at 20:53

1 Answers1

0

Did you try using a Wikipedia API to get all the links?. This is the best, and most accurate way to get such results.

In your case, you can use this API to get all the links inside Influenza page

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=linkshere&titles=Influenza&lhlimit=500

Only change Influenza in the previous link to any Wikipedia article, and it will work fine.

ASammour
  • 865
  • 9
  • 12