-2

I don't really have code for this problem. But I will try my best to actually explain everything.

Example alright, say you are scraping a website, and in the website there are 3 different links and you want to scrape what is inside each and everyone one of them without having to manually do it. Is this possible for just BeautifulSoup and the Requests library? Or would you have to use another library, for e.g scrapy.

If you want you can try it on this website: https://www.bleepingcomputer.com/ What I am trying to achieve is scrape the website, and what is inside the links at the same time.

If it's not possible to do it with only requests & Beautifulsoup feel free to use scrapy as well.

baduker
  • 19,152
  • 9
  • 33
  • 56
ninj
  • 1
  • Assuming the links are hyperlinks (a tags) and not buttons, you can access where they are redirecting to through the href property – hopperelec Nov 13 '22 at 10:58
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Nov 13 '22 at 10:59
  • Does this answer your question? [retrieve links from web page using python and BeautifulSoup](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) – hopperelec Nov 13 '22 at 11:15
  • @hopperelec no, because I'm trying to scrape the content inside the links, not the links itself. – ninj Nov 13 '22 at 11:54
  • @ninj Then why don't you just scrape it the same way you would with any other HTML element? Links are just HTML elements and the content of the link is the content of the HTML element – hopperelec Nov 13 '22 at 11:55
  • @hopperelec the aim is to scrape what is INSIDE the link. So a new page for e.g – ninj Nov 13 '22 at 12:01
  • 1
    That's not 'inside the link'. That is a completely different page the link happened to be linking to. You would need to make a new request to the linked page. You should edit your question to make it more clear because the way it is currently worded, it would usually mean two completely different things. – hopperelec Nov 13 '22 at 12:08
  • @hopperelec but would it be possible to automatically just go to the link and scrape everything in there? – ninj Nov 13 '22 at 12:11
  • You would need to make a new request to the linked page – hopperelec Nov 13 '22 at 12:18

3 Answers3

0

you can scrape the links via tag. The html template will have the hyperlink listed and the actual website it links you to should be listed in href. Ex:

<li href=“https://google.com > Site 1 </li> The href would be the destination link and the site 1 is just the text shown in page

0

scrape the website, and what is inside the links at the same time

Assuming that you mean you want to scrape all the pages that the links on the site lead to, I have this recursive crawler which should be able to do it. Just calling linkTreeMapper('https://www.bleepingcomputer.com/', None, 1) gave an output like this; the fuction was actually meant to go further, like linkTreeMapper('https://www.bleepingcomputer.com/', pageLimit=2) would return

{
    "url": "https://www.bleepingcomputer.com/",
    "atDepth": 0,
    "pageTitle": "BleepingComputer | Cybersecurity, Technology News and Support",
    "pageBodyText": "News Featured Latest US govt: Iranian hackers br...n in with Twitter Not a member yet? Register Now",
    "pageUrls": [
        {
            "url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
            "atDepth": 1,
            "pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
            "pageUrls": [
                {
                    "url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
                    "atDepth": 2,
                    "pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
                    "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
                    "pageUrls": [
                        {
                            "url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
                            "atDepth": 3,
                            "pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
                            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
                        },
                        {
                            "url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
                            "atDepth": 3,
                            "pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
                            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
                        }
                    ]
                },
                {
                    "url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
                    "atDepth": 2,
                    "pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
                    "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
                    "pageUrls": [
                        {
                            "url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
                            "atDepth": 3,
                            "errorMessage": "Message: timeout: Timed out receiving message from renderer: -0.009\n  (Session info: chrome=107.0.5304.88)\n"
                        },
                        {
                            "url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
                            "atDepth": 3,
                            "pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
                            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
                        }
                    ]
                }
            ]
        },
        {
            "url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
            "atDepth": 1,
            "pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
            "pageUrls": [
                {
                    "url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
                    "atDepth": 2,
                    "pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
                    "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
                    "pageUrls": [
                        {
                            "url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
                            "atDepth": 3,
                            "pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
                            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
                        },
                        {
                            "url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
                            "atDepth": 3,
                            "pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
                            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...SUBMIT advertisement advertisement advertisement"
                        }
                    ]
                },
                {
                    "url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
                    "atDepth": 2,
                    "pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
                    "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
                    "pageUrls": [
                        {
                            "url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
                            "atDepth": 3,
                            "pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
                            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
                        },
                        {
                            "url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
                            "atDepth": 3,
                            "pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
                            "pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
                        }
                    ]
                }
            ]
        }
    ]
}

(I set pageLimit=2 so that the output would be small enough to view here)


However, using recursion can be dangerous in cases like these, and as you can see, it's harder to eliminate repeated scrawling of the same page; so, it might be better to use this queue-based scrawler

## FIRST COPY [OR DOWNLOAD&IMPORT] REQUIREMENTS FROM https://pastebin.com/TBtYja5D ##
setGlobals(varDict={
    'starterUrl': 'https://www.bleepingcomputer.com/', 
    'pageLimit': None, 'maxScrapes': 55, 'scrapeCt': 0, 'curUrlId': 0
}, clearLog=True)
nextUrl = get_next_fromScrawlQ()
while nextUrl: nextUrl = logScrape(scrapeUrl(nextUrl))
saveScrawlSess('qScrawl_bleepingComp.csv', 'vScrawl_bleepC.json')

and "qScrawl_bleepingComp.csv" would look like

url refUrlId status urlId pageTitle pageText pageUrlCt newToQueue
https://www.bleepingcomputer.com/ [starter] scraped 1 BleepingComputer | Cybersecurity, Technology News and Support News Featured Latest Exploit rele...er Not a member yet? Register Now 237.0 129.0
https://www.facebook.com/BleepingComputer 1 ?scraped 2 BleepingComputer | New York NY NaN 0.0 0.0
https://twitter.com/BleepinComputer 1 scraped 3 NaN JavaScript is not available. We’v...ret — let’s give it another shot. 6.0 6.0
https://www.youtube.com/user/BleepingComputer 1 ?scraped 4 YouTube BleepingComputer - YouTube 0.0 0.0
https://www.bleepingcomputer.com/news/security/exploit-released-for-actively-abused-proxynotshell-exchange-bug/ 1 scraped 5 Exploit released for actively abused ProxyNotShell Exchange bug News Featured Latest Exploit rele... prohibited. Submitting... SUBMIT 148.0 26.0

(Only the first five rows [of 1.6k+] are above, and the last 2 columns have not been included - see uploaded csv for full output.)

The "pageUrlIds" columns has a list of numbers, each of which should correspond to a row according to the "urlId" column; so, if you wanted to, you could use that to form a nested dictionary like the output of the recursive crawler.


Is this possible for just BeautifulSoup and the Requests library?

For some sites, it might be - both of my crawlers use a linkToSoup function [to fetch and parse the html of each page], and I have several versions of it; the simple requests version didn't work for bleepingcomputer, so I used cloudscraper. However, I'm not skilled with setting headers for requests [beyond the the very basics], so someone else may have be able to figure out the perfect set of parameters...

Driftr95
  • 4,572
  • 2
  • 9
  • 21
-1

You can do it with only requests and BeautifulSoup. Just add the links to a list or a dict and iterate the list.

  • This isn't what I'm asking. – ninj Nov 13 '22 at 11:53
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 16 '22 at 13:47