I'm working on a graph project where I want an efficient way to grab all the links to other English wikipedia articles from a particular English wikipedia article.
Currently, I'm using bs4 and Python, but I don't know too much about bs4.
Here's what I have right now:
##### Imports #####
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
##### Functions #####
parser = 'html.parser'
resp = requests.get("https://en.wikipedia.org/wiki/Influenza")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type','').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
The problem with this is that I'm getting many unwanted links (non-english links or non-article links). I don't think I know enough about HTML to fix this, and I don't want to simply filter through every link given to me from the above find_all() call as that would be inefficient.
Any advice would be greatly appreciated. Thanks in advance!