I want to find all hyperlinks in a Wikipedia page and save them in a list in a recursive way within the Spanish locale. That is, taking all links in a first Spanish Wikipedia page and from each of them going to another page and so on recursively while saving all of them in a list.
The idea is having an infinite or endless hyperlink gathering tool that I could stop whenever I consider I have enough links.
Up to now I think I have the first steps, which are the trigger Spanish Wikipedia page and its link, which I have crawled in search for its hyperlinks, but I do not know how to make it recursive and make it go to each hyperlink repeating the process again and again.
Here is my code:
url = "https://es.wikipedia.org/wiki/Olula_del_Río" # URL of the trigger article
webpage = requests.get(url)
html_content = webpage.content
# Parse the webpage content
soup = BeautifulSoup(html_content, "lxml") # Another parser is 'html.parser'
#print(soup.prettify())
# Extract only the tags containing the hyperlinks
urls_list = []
for url in soup.find_all('a', href=True):
url = url.get('href')
url = unquote(url) # URL encoding
urls_list.append(url)
#print(url)
Now I would like to enter each hyperlink in urls_list
and repeat the same process with the hyperlinks in the corresponding page and append it to the list.
Is there a manageable way to do this?