scrape the website, and what is inside the links at the same time
Assuming that you mean you want to scrape all the pages that the links on the site lead to, I have this recursive crawler which should be able to do it. Just calling linkTreeMapper('https://www.bleepingcomputer.com/', None, 1)
gave an output like this; the fuction was actually meant to go further, like linkTreeMapper('https://www.bleepingcomputer.com/', pageLimit=2)
would return
{
"url": "https://www.bleepingcomputer.com/",
"atDepth": 0,
"pageTitle": "BleepingComputer | Cybersecurity, Technology News and Support",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...n in with Twitter Not a member yet? Register Now",
"pageUrls": [
{
"url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
"atDepth": 1,
"pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
"pageUrls": [
{
"url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
"atDepth": 2,
"pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
"pageUrls": [
{
"url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
"atDepth": 3,
"pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
},
{
"url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
"atDepth": 3,
"pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
}
]
},
{
"url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
"atDepth": 2,
"pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
"pageUrls": [
{
"url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
"atDepth": 3,
"errorMessage": "Message: timeout: Timed out receiving message from renderer: -0.009\n (Session info: chrome=107.0.5304.88)\n"
},
{
"url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
"atDepth": 3,
"pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
}
]
}
]
},
{
"url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
"atDepth": 1,
"pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
"pageUrls": [
{
"url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
"atDepth": 2,
"pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
"pageUrls": [
{
"url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
"atDepth": 3,
"pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
},
{
"url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
"atDepth": 3,
"pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...SUBMIT advertisement advertisement advertisement"
}
]
},
{
"url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
"atDepth": 2,
"pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT",
"pageUrls": [
{
"url": "https://www.bleepingcomputer.com/news/security/us-govt-iranian-hackers-breached-federal-agency-using-log4shell-exploit/",
"atDepth": 3,
"pageTitle": "US govt: Iranian hackers breached federal agency using Log4Shell exploit",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
},
{
"url": "https://www.bleepingcomputer.com/news/security/google-to-roll-out-privacy-sandbox-on-android-13-starting-early-2023/",
"atDepth": 3,
"pageTitle": "Google to roll out Privacy Sandbox on Android 13 starting early 2023",
"pageBodyText": "News Featured Latest US govt: Iranian hackers br...what content is prohibited. Submitting... SUBMIT"
}
]
}
]
}
]
}
(I set pageLimit=2
so that the output would be small enough to view here)
However, using recursion can be dangerous in cases like these, and as you can see, it's harder to eliminate repeated scrawling of the same page; so, it might be better to use this queue-based scrawler
## FIRST COPY [OR DOWNLOAD&IMPORT] REQUIREMENTS FROM https://pastebin.com/TBtYja5D ##
setGlobals(varDict={
'starterUrl': 'https://www.bleepingcomputer.com/',
'pageLimit': None, 'maxScrapes': 55, 'scrapeCt': 0, 'curUrlId': 0
}, clearLog=True)
nextUrl = get_next_fromScrawlQ()
while nextUrl: nextUrl = logScrape(scrapeUrl(nextUrl))
saveScrawlSess('qScrawl_bleepingComp.csv', 'vScrawl_bleepC.json')
and "qScrawl_bleepingComp.csv" would look like
url |
refUrlId |
status |
urlId |
pageTitle |
pageText |
pageUrlCt |
newToQueue |
https://www.bleepingcomputer.com/ |
[starter] |
scraped |
1 |
BleepingComputer | Cybersecurity, Technology News and Support |
News Featured Latest Exploit rele...er Not a member yet? Register Now |
237.0 |
129.0 |
https://www.facebook.com/BleepingComputer |
1 |
?scraped |
2 |
BleepingComputer | New York NY |
NaN |
0.0 |
0.0 |
https://twitter.com/BleepinComputer |
1 |
scraped |
3 |
NaN |
JavaScript is not available. We’v...ret — let’s give it another shot. |
6.0 |
6.0 |
https://www.youtube.com/user/BleepingComputer |
1 |
?scraped |
4 |
YouTube |
BleepingComputer - YouTube |
0.0 |
0.0 |
https://www.bleepingcomputer.com/news/security/exploit-released-for-actively-abused-proxynotshell-exchange-bug/ |
1 |
scraped |
5 |
Exploit released for actively abused ProxyNotShell Exchange bug |
News Featured Latest Exploit rele... prohibited. Submitting... SUBMIT |
148.0 |
26.0 |
(Only the first five rows [of 1.6k+] are above, and the last 2 columns have not been included - see uploaded csv for full output.)
The "pageUrlIds" columns has a list of numbers, each of which should correspond to a row according to the "urlId" column; so, if you wanted to, you could use that to form a nested dictionary like the output of the recursive crawler.
Is this possible for just BeautifulSoup and the Requests library?
For some sites, it might be - both of my crawlers use a linkToSoup
function [to fetch and parse the html of each page], and I have several versions of it; the simple requests
version didn't work for bleepingcomputer, so I used cloudscraper. However, I'm not skilled with setting headers for requests [beyond the the very basics], so someone else may have be able to figure out the perfect set of parameters...