1

I'm working on a project that require to extract all links from a website, with using this code I'll get all of links from single URL:

import requests
from bs4 import BeautifulSoup, SoupStrainer

source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
links = []

for link in soup.find_all('a'):
    links.append(str(link))

problem is that if I want to extract all URLs, I have to write another for loop and then another one ... . I want to extract all URLs that are exist in this website and in this website's sub domains. is there any way to do this without writing nested for? and even with writing nested for, I don't know how many for should I use to get all URLs.

  • Does this answer your question? [retrieve links from web page using python and BeautifulSoup](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) – l'L'l Dec 15 '19 at 19:30
  • No,it's not. Also the answer of this question are no longer works because the BeautifulSoup changed since then. –  Dec 15 '19 at 19:33
  • @Mona well so you will need to use `API` of stackoverflow. – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 19:50
  • it's second time that you did guys delete you answers, :((( –  Dec 15 '19 at 19:53
  • I need a algorithm that work on every websites. –  Dec 15 '19 at 19:53
  • Do you just want URL's that start with https://www.stackoverflow.com/ – oppressionslayer Dec 15 '19 at 19:54
  • @Mona actually we just read your question as crawling webpage. not the full url. anyway, what you looking for is a full crawler of website. so you need infinite loop which will never break till your memory boom. it's will keep get each url then open each url and collect links and so on. – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 19:55
  • Can you clarify what you’re trying to do? After reading @αԋɱҽԃαмєяιcαη comment I’m no longer certain that I understand. – AMC Dec 15 '19 at 19:57
  • @AlexanderCécile she's looking to getting each single url inside website. `I want to extract all URLs that are exist in this website and in this website's sub domains.` – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 19:58
  • @oppressionslayer I need all URLs inside of my URL and all of my sub URLs too not only the sub URL. –  Dec 15 '19 at 19:59
  • @αԋɱҽԃ-αмєяιcαη my target website have limited URLs inside of it –  Dec 15 '19 at 20:00
  • @Mona What do you mean by _all of my sub URLs too not only the sub URL_? In any case, this sounds like web crawling, not web scraping. – AMC Dec 15 '19 at 20:01
  • We might still be able to help, if you can provide more details. You want only the URLs on this single domain? – AMC Dec 15 '19 at 20:03
  • @alexander-cécile yeah, your right, its web crawling not web scraping, now how Can I do that? –  Dec 15 '19 at 20:03
  • Related/possible duplicate: https://stackoverflow.com/q/1080411/11301900 – AMC Dec 15 '19 at 21:51

3 Answers3

5

WoW, it takes about 30 min to find a solution, I found a simple and efficient way to do this, As @αԋɱҽԃ-αмєяιcαη mentioned, some time if your website linked to a BIG website like google, etc, it wont be stop until you memory get full of data. so there are steps that you should consider.

  1. make a while loop to seek thorough your website to extract all of urls
  2. use Exceptions handling to prevent crashes
  3. remove duplicates and separate the urls
  4. set a limitation to number of urls, like when 1000 urls found
  5. stop while loop to prevent your PC's memory getting full

here a sample code and it should works fine, I actually tested it and it was fun fore me:

import requests
from bs4 import BeautifulSoup
import re
import time

source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []


def remove_duplicates(l): # remove duplicates and unURL string
    for item in l:
        match = re.search("(?P<url>https?://[^\s]+)", item)
        if match is not None:
            links.append((match.group("url")))


for link in soup.find_all('a', href=True):
    data.append(str(link.get('href')))
flag = True
remove_duplicates(data)
while flag:
    try:
        for link in links:
            for j in soup.find_all('a', href=True):
                temp = []
                source_code = requests.get(link)
                soup = BeautifulSoup(source_code.content, 'lxml')
                temp.append(str(j.get('href')))
                remove_duplicates(temp)

                if len(links) > 162: # set limitation to number of URLs
                    break
            if len(links) > 162:
                break
        if len(links) > 162:
            break
    except Exception as e:
        print(e)
        if len(links) > 162:
            break

for url in links:
print(url)

and the output will be:

https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackexchange.com/sites
https://stackoverflow.blog
https://stackoverflow.com/legal/cookie-policy
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/legal/terms-of-service/public
https://stackoverflow.com/teams
https://stackoverflow.com/teams
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/questions/55884514/what-is-the-incentive-for-curl-to-release-the-library-for-free/55885729#55885729
https://insights.stackoverflow.com/
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/jobs
https://stackoverflow.com/jobs/directory/developer-jobs
https://stackoverflow.com/jobs/salary
https://www.stackoverflowbusiness.com
https://stackoverflow.com/teams
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/enterprise
https://stackoverflow.com/company/about
https://stackoverflow.com/company/about
https://stackoverflow.com/company/press
https://stackoverflow.com/company/work-here
https://stackoverflow.com/legal
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/company/contact
https://stackexchange.com
https://stackoverflow.com
https://serverfault.com
https://superuser.com
https://webapps.stackexchange.com
https://askubuntu.com
https://webmasters.stackexchange.com
https://gamedev.stackexchange.com
https://tex.stackexchange.com
https://softwareengineering.stackexchange.com
https://unix.stackexchange.com
https://apple.stackexchange.com
https://wordpress.stackexchange.com
https://gis.stackexchange.com
https://electronics.stackexchange.com
https://android.stackexchange.com
https://security.stackexchange.com
https://dba.stackexchange.com
https://drupal.stackexchange.com
https://sharepoint.stackexchange.com
https://ux.stackexchange.com
https://mathematica.stackexchange.com
https://salesforce.stackexchange.com
https://expressionengine.stackexchange.com
https://pt.stackoverflow.com
https://blender.stackexchange.com
https://networkengineering.stackexchange.com
https://crypto.stackexchange.com
https://codereview.stackexchange.com
https://magento.stackexchange.com
https://softwarerecs.stackexchange.com
https://dsp.stackexchange.com
https://emacs.stackexchange.com
https://raspberrypi.stackexchange.com
https://ru.stackoverflow.com
https://codegolf.stackexchange.com
https://es.stackoverflow.com
https://ethereum.stackexchange.com
https://datascience.stackexchange.com
https://arduino.stackexchange.com
https://bitcoin.stackexchange.com
https://sqa.stackexchange.com
https://sound.stackexchange.com
https://windowsphone.stackexchange.com
https://stackexchange.com/sites#technology
https://photo.stackexchange.com
https://scifi.stackexchange.com
https://graphicdesign.stackexchange.com
https://movies.stackexchange.com
https://music.stackexchange.com
https://worldbuilding.stackexchange.com
https://video.stackexchange.com
https://cooking.stackexchange.com
https://diy.stackexchange.com
https://money.stackexchange.com
https://academia.stackexchange.com
https://law.stackexchange.com
https://fitness.stackexchange.com
https://gardening.stackexchange.com
https://parenting.stackexchange.com
https://stackexchange.com/sites#lifearts
https://english.stackexchange.com
https://skeptics.stackexchange.com
https://judaism.stackexchange.com
https://travel.stackexchange.com
https://christianity.stackexchange.com
https://ell.stackexchange.com
https://japanese.stackexchange.com
https://chinese.stackexchange.com
https://french.stackexchange.com
https://german.stackexchange.com
https://hermeneutics.stackexchange.com
https://history.stackexchange.com
https://spanish.stackexchange.com
https://islam.stackexchange.com
https://rus.stackexchange.com
https://russian.stackexchange.com
https://gaming.stackexchange.com
https://bicycles.stackexchange.com
https://rpg.stackexchange.com
https://anime.stackexchange.com
https://puzzling.stackexchange.com
https://mechanics.stackexchange.com
https://boardgames.stackexchange.com
https://bricks.stackexchange.com
https://homebrew.stackexchange.com
https://martialarts.stackexchange.com
https://outdoors.stackexchange.com
https://poker.stackexchange.com
https://chess.stackexchange.com
https://sports.stackexchange.com
https://stackexchange.com/sites#culturerecreation
https://mathoverflow.net
https://math.stackexchange.com
https://stats.stackexchange.com
https://cstheory.stackexchange.com
https://physics.stackexchange.com
https://chemistry.stackexchange.com
https://biology.stackexchange.com
https://cs.stackexchange.com
https://philosophy.stackexchange.com
https://linguistics.stackexchange.com
https://psychology.stackexchange.com
https://scicomp.stackexchange.com
https://stackexchange.com/sites#science
https://meta.stackexchange.com
https://stackapps.com
https://api.stackexchange.com
https://data.stackexchange.com
https://stackoverflow.blog?blb=1
https://www.facebook.com/officialstackoverflow/
https://twitter.com/stackoverflow
https://linkedin.com/company/stack-overflow
https://creativecommons.org/licenses/by-sa/4.0/
https://stackoverflow.blog/2009/06/25/attribution-required/
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising

Process finished with exit code 0

I set the limitation to 162, you can increase it as many as you want and you ram allowed.

Ali Akhtari
  • 1,211
  • 2
  • 21
  • 42
  • 1
    thank you so much. you saved my day :) long and a little dirty code, but it works fine as you said. –  Dec 15 '19 at 21:07
  • 1
    Wait what’s up with `remove_duplicates()`?! Why not just extract the url and put it in a set? – AMC Dec 15 '19 at 21:57
  • 1
    @alexander-cécile yeah,my code is nasty,I'm a little bit busy so I will edit it tomorrow, and about checking `if len(links) > 162`, I did this to check this condition after every step and I know it's not necessary. – Ali Akhtari Dec 15 '19 at 21:57
  • @alexander-cécile if you have extra time, I will be glad if you edit my answer, other wise I will edit it latter. – Ali Akhtari Dec 15 '19 at 21:58
  • @Ali I think I’ll make a separate answer, because I plan on using a different library entirely. – AMC Dec 15 '19 at 21:59
  • @alexander-cécile glad to hear it, can't waiting to learn new way to do this. – Ali Akhtari Dec 15 '19 at 22:00
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204312/discussion-between-ali-and-alexander-cecile). – Ali Akhtari Dec 15 '19 at 22:01
0

How's this?

import re,requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
source_code = requests.get('https://stackoverflow.com/')
doc = SimplifiedDoc(source_code.content.decode('utf-8')) # incoming HTML string
lst = doc.listA(url='https://stackoverflow.com/') # get all links
for a in lst:
  if(a['url'].find('stackoverflow.com')>0): #sub domains
    print (a['url'])

You can also use this crawl frame, which can help you do many things

from simplified_scrapy.spider import Spider, SimplifiedDoc
class DemoSpider(Spider):
  name = 'demo-spider'
  start_urls = ['http://www.example.com/']
  allowed_domains = ['example.com/']
  def extract(self, url, html, models, modelNames):
    doc = SimplifiedDoc(html)
    lstA = doc.listA(url=url["url"])
    return [{"Urls": lstA, "Data": None}]

from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(DemoSpider())
dabingsou
  • 2,469
  • 1
  • 5
  • 8
-1

Well, actually what you are asking for is possible but that's mean an infinite loop which will keep run and run till your memory BoOoOoOm

Anyway the idea should be like the following.

  • you will use for item in soup.findAll('a') and then item.get('href')

  • add to set to get rid of duplicates urls and use with if condition is not None to get rid of None objects.

  • then keep looping over and over till your set became 0 something like len(urls)

  • It's not the answer of my question, for instance there is over 10,000,000 questions in stackoverflow, I need a code that extract all of existence URLs include the URLs of all of stackoverflow's post and etc. –  Dec 15 '19 at 19:50
  • can you give me a code to know to how implement this idea? –  Dec 15 '19 at 20:09
  • 1
    @Mona that's will need you to implement this by yourself. because you will need to use `try/except` and `timeout` and `threading`.. many things ! what if the `href` hold a path of `url` such as `/file1/file2/` so that's will need `f` string. to be like `f"www.site.com/{url}"` and too many others things which need to dev him/her self to be focus on. – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 20:14
  • @Mona What are you using all this for? – AMC Dec 15 '19 at 20:26
  • I agree with @αԋɱҽԃαмєяιcαη though, this is a rather large question. – AMC Dec 15 '19 at 20:27
  • @αԋɱҽԃ-αмєяιcαη I'm a new member of stackoverflow, some body answered my question very well, my question wasn't based on opinion, it was just a question that you weren't able to help. –  Dec 15 '19 at 21:12
  • @Mona if you have taken time to what I’ve typed so you will not see difference. But anyway feel free to think as you want . We’ve used to deal with such comments – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 21:47
  • 1
    @αԋɱҽԃαмєяιcαη I google about ram usage about this question, with 12 GB of ram you are able to store about 128849018 URL (100 character per url) in your ram as a variable, so I think it won't be a problem. – Ali Akhtari Dec 15 '19 at 22:50
  • @Ali you need to understand the difference between RAM and Storage. – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 22:57
  • @αԋɱҽԃαмєяιcαη I know the difference – Ali Akhtari Dec 15 '19 at 22:58
  • @αԋɱҽԃαмєяιcαη run the code and look at you ram usage, with about 10000 link scraped, my ram barley moved. – Ali Akhtari Dec 15 '19 at 22:58
  • 1
    @Ali well so you will be forever running your program and waiting for it to finish. since you are talking about for loop running without threads which use only one variable like a rat which keep run and stop. but with threads the scenario will be different. – αԋɱҽԃ αмєяιcαη Dec 15 '19 at 23:10