WoW, it takes about 30 min to find a solution,
I found a simple and efficient way to do this,
As @αԋɱҽԃ-αмєяιcαη mentioned, some time if your website linked to a BIG website like google, etc, it wont be stop until you memory get full of data.
so there are steps that you should consider.
- make a while loop to seek thorough your website to extract all of urls
- use Exceptions handling to prevent crashes
- remove duplicates and separate the urls
- set a limitation to number of urls, like when 1000 urls found
- stop while loop to prevent your PC's memory getting full
here a sample code and it should works fine, I actually tested it and it was fun fore me:
import requests
from bs4 import BeautifulSoup
import re
import time
source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []
def remove_duplicates(l): # remove duplicates and unURL string
for item in l:
match = re.search("(?P<url>https?://[^\s]+)", item)
if match is not None:
links.append((match.group("url")))
for link in soup.find_all('a', href=True):
data.append(str(link.get('href')))
flag = True
remove_duplicates(data)
while flag:
try:
for link in links:
for j in soup.find_all('a', href=True):
temp = []
source_code = requests.get(link)
soup = BeautifulSoup(source_code.content, 'lxml')
temp.append(str(j.get('href')))
remove_duplicates(temp)
if len(links) > 162: # set limitation to number of URLs
break
if len(links) > 162:
break
if len(links) > 162:
break
except Exception as e:
print(e)
if len(links) > 162:
break
for url in links:
print(url)
and the output will be:
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackexchange.com/sites
https://stackoverflow.blog
https://stackoverflow.com/legal/cookie-policy
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/legal/terms-of-service/public
https://stackoverflow.com/teams
https://stackoverflow.com/teams
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/questions/55884514/what-is-the-incentive-for-curl-to-release-the-library-for-free/55885729#55885729
https://insights.stackoverflow.com/
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/jobs
https://stackoverflow.com/jobs/directory/developer-jobs
https://stackoverflow.com/jobs/salary
https://www.stackoverflowbusiness.com
https://stackoverflow.com/teams
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/enterprise
https://stackoverflow.com/company/about
https://stackoverflow.com/company/about
https://stackoverflow.com/company/press
https://stackoverflow.com/company/work-here
https://stackoverflow.com/legal
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/company/contact
https://stackexchange.com
https://stackoverflow.com
https://serverfault.com
https://superuser.com
https://webapps.stackexchange.com
https://askubuntu.com
https://webmasters.stackexchange.com
https://gamedev.stackexchange.com
https://tex.stackexchange.com
https://softwareengineering.stackexchange.com
https://unix.stackexchange.com
https://apple.stackexchange.com
https://wordpress.stackexchange.com
https://gis.stackexchange.com
https://electronics.stackexchange.com
https://android.stackexchange.com
https://security.stackexchange.com
https://dba.stackexchange.com
https://drupal.stackexchange.com
https://sharepoint.stackexchange.com
https://ux.stackexchange.com
https://mathematica.stackexchange.com
https://salesforce.stackexchange.com
https://expressionengine.stackexchange.com
https://pt.stackoverflow.com
https://blender.stackexchange.com
https://networkengineering.stackexchange.com
https://crypto.stackexchange.com
https://codereview.stackexchange.com
https://magento.stackexchange.com
https://softwarerecs.stackexchange.com
https://dsp.stackexchange.com
https://emacs.stackexchange.com
https://raspberrypi.stackexchange.com
https://ru.stackoverflow.com
https://codegolf.stackexchange.com
https://es.stackoverflow.com
https://ethereum.stackexchange.com
https://datascience.stackexchange.com
https://arduino.stackexchange.com
https://bitcoin.stackexchange.com
https://sqa.stackexchange.com
https://sound.stackexchange.com
https://windowsphone.stackexchange.com
https://stackexchange.com/sites#technology
https://photo.stackexchange.com
https://scifi.stackexchange.com
https://graphicdesign.stackexchange.com
https://movies.stackexchange.com
https://music.stackexchange.com
https://worldbuilding.stackexchange.com
https://video.stackexchange.com
https://cooking.stackexchange.com
https://diy.stackexchange.com
https://money.stackexchange.com
https://academia.stackexchange.com
https://law.stackexchange.com
https://fitness.stackexchange.com
https://gardening.stackexchange.com
https://parenting.stackexchange.com
https://stackexchange.com/sites#lifearts
https://english.stackexchange.com
https://skeptics.stackexchange.com
https://judaism.stackexchange.com
https://travel.stackexchange.com
https://christianity.stackexchange.com
https://ell.stackexchange.com
https://japanese.stackexchange.com
https://chinese.stackexchange.com
https://french.stackexchange.com
https://german.stackexchange.com
https://hermeneutics.stackexchange.com
https://history.stackexchange.com
https://spanish.stackexchange.com
https://islam.stackexchange.com
https://rus.stackexchange.com
https://russian.stackexchange.com
https://gaming.stackexchange.com
https://bicycles.stackexchange.com
https://rpg.stackexchange.com
https://anime.stackexchange.com
https://puzzling.stackexchange.com
https://mechanics.stackexchange.com
https://boardgames.stackexchange.com
https://bricks.stackexchange.com
https://homebrew.stackexchange.com
https://martialarts.stackexchange.com
https://outdoors.stackexchange.com
https://poker.stackexchange.com
https://chess.stackexchange.com
https://sports.stackexchange.com
https://stackexchange.com/sites#culturerecreation
https://mathoverflow.net
https://math.stackexchange.com
https://stats.stackexchange.com
https://cstheory.stackexchange.com
https://physics.stackexchange.com
https://chemistry.stackexchange.com
https://biology.stackexchange.com
https://cs.stackexchange.com
https://philosophy.stackexchange.com
https://linguistics.stackexchange.com
https://psychology.stackexchange.com
https://scicomp.stackexchange.com
https://stackexchange.com/sites#science
https://meta.stackexchange.com
https://stackapps.com
https://api.stackexchange.com
https://data.stackexchange.com
https://stackoverflow.blog?blb=1
https://www.facebook.com/officialstackoverflow/
https://twitter.com/stackoverflow
https://linkedin.com/company/stack-overflow
https://creativecommons.org/licenses/by-sa/4.0/
https://stackoverflow.blog/2009/06/25/attribution-required/
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
Process finished with exit code 0
I set the limitation to 162, you can increase it as many as you want and you ram allowed.