Extract a list of URLs from linkGrabber web search

Question

I am trying to perform a google web search then extract all the urls from the search but I am stuck. Here is the web search: "intext:testfile.exe"

https://www.google.com/search?source=hp&ei=_rNeXtXEHLWwytMPhJmz0Aw&q=intext%3Atestfile.exe

Here is the python code that I have so far:

import re
import linkGrabber

#links = linkGrabber.Links('https://www.google.com/search?source=hp&ei=_rNeXtXEHLWwytMPhJmz0Aw&q=intext%3Atestfile.exe')
for x in links.find('a', attrs={'href': re.compile("^https://")}, duplicates=False):
    #print (x.get('href'))
    fo = open("URLs.txt", "w")
    fo.write(x.get('href'))
fo.close()

_but I am stuck_ Please be more specific. – AMC Mar 03 '20 at 20:35 — AMC, Mar 03 '20 at 20:35

score 0 · Answer 1 · answered Mar 03 '20 at 20:20

I'm not exactly familiar with linkGrabber but in terms of BS4 (and this can be accomplished with just bs4):

from bs4 import BeautifulSoup
import requests
import re

soup = BeautifulSoup(requests.get('https://www.google.com/search?source=hp&ei=_rNeXtXEHLWwytMPhJmz0Aw&q=intext%3Atestfile.exe').content) 
with open('urls.txt', 'w') as f:
    for link in soup.find_all(name='a', attrs={'href': re.compile('/url\?q=')}):
       f.write(link.attrs['href'].lstrip('/url?q=')

Produces the following:

https://kc.mcafee.com/corporate/index%3Fpage%3Dcontent%26id%3DKB90863&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAAegQIARAB&usg=AOvVaw2rsfAwi5ERbRHQ81bnwVEj
https://community.mcafee.com/t5/Endpoint-Security-ENS/McAfee-ATP-RP-S-TestFile-exe-ID-5/td-p/623667&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjABegQICRAB&usg=AOvVaw0rFtkZ9BUM8rWrBnvrfVqK
https://www.hybrid-analysis.com/sample/56afd27f2010b63ed00d8db0034833a1dc63bd3dae41c2555e2669e445815d41%3FenvironmentId%3D100&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjACegQIBxAB&usg=AOvVaw3Lv0GrbOT-c0TY60n5XNKY
https://hybrid-analysis.com/sample/2ef68884f5b59c6ff4240e6e61e1583fe77cc28a4494bfb2e7a395b31bc49e91%3FenvironmentId%3D100&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjADegQICBAB&usg=AOvVaw0wREqjYkINqxsESIhanSFV
https://www.joesandbox.com/analysis/45105/0/pdf&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAEegQIBhAB&usg=AOvVaw223qTN8J8GwX4BiyrwJhwW
http://helpserver.biz/onlinehelp/lpmme/7.0/generator/help2000/exe-flash_application_using_exe_fi.html&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAFegQIBBAB&usg=AOvVaw03DjaCaUcaKWe3ygGSbWI1
https://stackoverflow.com/a/54102103&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAGegQIAxAB&usg=AOvVaw0L9O3uHmsRM9FORU3yYVmf
https://wiki.itarian.com/frontend/web/topic/download-pdf/ccs-profile-paths-rules-and-special-symbol-use&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAHegQIBRAB&usg=AOvVaw20Y_3a7Ewk2wmWN6uscOzi
https://www.ccleaner.com/docs/defraggler/advanced-usage/command-line-parameters&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAIegQIABAB&usg=AOvVaw2BSd_sIyvQFxtaerNOmNSJ
https://www.registry-programs.com/process/list/testfile.exe.html&sa=U&ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAJegQIAhAB&usg=AOvVaw1UNdwgn-QTqhYQ8m_MpI5j
https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fsource%253Dhp%2526ei%253D_rNeXtXEHLWwytMPhJmz0Aw%2526q%253Dintext:testfile.exe%26hl%3Den&sa=U&ved=0ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8Qxs8CCC0&usg=AOvVaw0XtHw3wxFlE0uI5pIgV9Zq

I don't know where that last link come from but if it's consistent it's pretty simple to check for. Maybe linkGrabber won't return that one. You could look explicitly for some match to that link type or just don't write urls that return 404 to your file you could check that with requests.get

As for an explanation, skipping the import statements:

Grab the html for the the link OP provided, and parse using BeautifulSoup into a BeautifulSoup object.
Use a context manager to open the output file just means you don't have to explicitly close it and it's nice and pythonic
iterate over all the a tags in the soup with an href attribute that matches the provided regex string. Note, I changed what that regex string is due to what was actually in the html, google must prepend search results with that substring. Also including the https part omits results which are not https secured. Here's an example of the actual HTML from the search provided

<a href="/url?q=https://www.registry-programs.com/process/list/testfile.exe.html&amp;sa=U&amp;ved=2ahUKEwiS7JLniv_nAhWmyYsBHVKeCG8QFjAJegQIAhAB&amp;usg=AOvVaw1UNdwgn-QTqhYQ8m_MpI5j"><div class="BNeawe vvjwJb AP7Wnd">Is it testfile.exe a virus? How to fix testfile.exe file error?</div><div class="BNeawe UPmit AP7Wnd">https://www.registry-programs.com › process › list › testfile.exe.html</div></a>

Then we use a Tag's .attrs to access its attributes as a dictionary and remove that weird string google adds from it, and write to the open file

Final note, I suspect the issue you are having is with the regex string you provided to find_all, try it with the one from my solution.

Thank you for your detailed response however it looks like all the suggested urls are not returning, only about 11. — antmar904, Mar 03 '20 at 20:55
That is correct. Even using linkGrabber (I just installed it and tried it) when scraping html you are only getting one page of results at a time. There are exactly ten results per page in a google search, the eleventh is that pesky one I mentioned. (Side note, link grabber returns extra links that don't seem relevant). If you are expecting to scrape every single result from some google search using linkGrabber or straight bs4 you'll have to iterate each page of search results. — R. Arctor, Mar 03 '20 at 21:15
Maybe this would be helpful: https://github.com/MarioVilas/googlesearch worth noting that this sort of scraping may be against google's terms of service but I'm not positive. — R. Arctor, Mar 03 '20 at 21:18

Extract a list of URLs from linkGrabber web search

1 Answers1