I'm not exactly familiar with linkGrabber but in terms of BS4 (and this can be accomplished with just bs4):
from bs4 import BeautifulSoup
import requests
import re
soup = BeautifulSoup(requests.get('https://www.google.com/search?source=hp&ei=_rNeXtXEHLWwytMPhJmz0Aw&q=intext%3Atestfile.exe').content)
with open('urls.txt', 'w') as f:
for link in soup.find_all(name='a', attrs={'href': re.compile('/url\?q=')}):
f.write(link.attrs['href'].lstrip('/url?q=')
Produces the following:
https://kc.mcafee.com/corporate/index%3Fpage%3Dcontent%26id%3DKB90863&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAAegQIARAB&usg=AOvVaw2rsfAwi5ERbRHQ81bnwVEj
https://community.mcafee.com/t5/Endpoint-Security-ENS/McAfee-ATP-RP-S-TestFile-exe-ID-5/td-p/623667&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjABegQICRAB&usg=AOvVaw0rFtkZ9BUM8rWrBnvrfVqK
https://www.hybrid-analysis.com/sample/56afd27f2010b63ed00d8db0034833a1dc63bd3dae41c2555e2669e445815d41%3FenvironmentId%3D100&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjACegQIBxAB&usg=AOvVaw3Lv0GrbOT-c0TY60n5XNKY
https://hybrid-analysis.com/sample/2ef68884f5b59c6ff4240e6e61e1583fe77cc28a4494bfb2e7a395b31bc49e91%3FenvironmentId%3D100&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjADegQICBAB&usg=AOvVaw0wREqjYkINqxsESIhanSFV
https://www.joesandbox.com/analysis/45105/0/pdf&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAEegQIBhAB&usg=AOvVaw223qTN8J8GwX4BiyrwJhwW
http://helpserver.biz/onlinehelp/lpmme/7.0/generator/help2000/exe-flash_application_using_exe_fi.html&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAFegQIBBAB&usg=AOvVaw03DjaCaUcaKWe3ygGSbWI1
https://stackoverflow.com/a/54102103&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAGegQIAxAB&usg=AOvVaw0L9O3uHmsRM9FORU3yYVmf
https://wiki.itarian.com/frontend/web/topic/download-pdf/ccs-profile-paths-rules-and-special-symbol-use&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAHegQIBRAB&usg=AOvVaw20Y_3a7Ewk2wmWN6uscOzi
https://www.ccleaner.com/docs/defraggler/advanced-usage/command-line-parameters&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAIegQIABAB&usg=AOvVaw2BSd_sIyvQFxtaerNOmNSJ
https://www.registry-programs.com/process/list/testfile.exe.html&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAJegQIAhAB&usg=AOvVaw1UNdwgn-QTqhYQ8m_MpI5j
https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fsource%253Dhp%2526ei%253D_rNeXtXEHLWwytMPhJmz0Aw%2526q%253Dintext:testfile.exe%26hl%3Den&sa=U&ved=0ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8Qxs8CCC0&usg=AOvVaw0XtHw3wxFlE0uI5pIgV9Zq
I don't know where that last link come from but if it's consistent it's pretty simple to check for. Maybe linkGrabber won't return that one. You could look explicitly for some match to that link type or just don't write urls that return 404 to your file you could check that with requests.get
As for an explanation, skipping the import statements:
- Grab the html for the the link OP provided, and parse using BeautifulSoup into a BeautifulSoup object.
- Use a context manager to open the output file just means you don't have to explicitly close it and it's nice and pythonic
- iterate over all the a tags in the soup with an href attribute that matches the provided regex string. Note, I changed what that regex string is due to what was actually in the html, google must prepend search results with that substring. Also including the https part omits results which are not https secured. Here's an example of the actual HTML from the search provided
<a href="/url?q=https://www.registry-programs.com/process/list/testfile.exe.html&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAJegQIAhAB&usg=AOvVaw1UNdwgn-QTqhYQ8m_MpI5j"><div class="BNeawe vvjwJb AP7Wnd">Is it testfile.exe a virus? How to fix testfile.exe file error?</div><div class="BNeawe UPmit AP7Wnd">https://www.registry-programs.com › process › list › testfile.exe.html</div></a>
- Then we use a Tag's
.attrs
to access its attributes as a dictionary and remove that weird string google adds from it, and write to the open file
Final note, I suspect the issue you are having is with the regex string you provided to find_all
, try it with the one from my solution.