-1

How do I get as output the list of LINKS only? I have tried other solutions with both beautifulsoup and selennium but they still give me a result very similar to the one I am currently getting, which is the href of the link AND the anchor text. I tried to use urlparse as some older answers suggested, but it seems that that module is not in use anymore and I am confused about the whole thing. This is my code, currently outtputting the link AND the anchor text, which is NOT what I want:

import requests, re
from bs4 import BeautifulSoup
headers = {'User-agent':'Mozilla/5.0'}
page = requests.get('https://www.google.com/search?q=Tesla',headers=headers)
soup = BeautifulSoup(page.content,'lxml')
global serpUrls
serpUrls = []
links = soup.findAll('a')
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    #print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
    serpUrls.append(link)

print(serpUrls[0:2])

xmasRegex = re.compile(r"""((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.‌​][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))*))+(?:(([^\s()<>]+|(‌​([^\s()<>]+)))*)|[^\s`!()[]{};:'".,<>?«»“”‘’]))""", re.DOTALL)
mo = xmasRegex.findall('[<a href="/url?q=https://www.teslamotors.com/&amp;sa=U&amp;ved=0ahUKEwjvzrTyxvTKAhXHWRoKHUjlBxwQFggUMAA&amp;usg=AFQjCNG1nvN_Z0knKTtEah3whTIObUAhcg"><b>Tesla</b> Motors | Premium Electric Vehicles</a>, <a class="_Zkb" href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:rzPQodkDKYYJ:https://www.teslamotors.com/%252BTesla%26gws_rd%3Dcr%26hl%3Des%26%26ct%3Dclnk&amp;sa=U&amp;ved=0ahUKEwjvzrTyxvTKAhXHWRoKHUjlBxwQIAgXMAA&amp;usg=AFQjCNEZ40VWO_fFDjXH09GakUOgODNlHg">En caché</a>]')
print(mo)

I only want the "http://urloflink.com", not the whole line of code. Any way to do this? Thanks!

Output looks like this:

[<a href="/url?q=https://www.teslamotors.com/&amp;sa=U&amp;ved=0ahUKEwjI39vl2_TKAhXFWxoKHRX-CFgQFggUMAA&amp;usg=AFQjCNG1nvN_Z0knKTtEah3whTIObUAhcg"><b>Tesla</b> Motors | Premium Electric Vehicles</a>, <a class="_Zkb" href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:rzPQodkDKYYJ:https://www.teslamotors.com/%252BTesla%26gws_rd%3Dcr%26hl%3Des%26%26ct%3Dclnk&amp;sa=U&amp;ved=0ahUKEwjI39vl2_TKAhXFWxoKHRX-CFgQIAgXMAA&amp;usg=AFQjCNEZ40VWO_fFDjXH09GakUOgODNlHg">En caché</a>]
[('https://www.teslamotors.com/&amp;sa=U&amp;ved=0ahUKEwjvzrTyxvTKAhXHWRoKHUjlBxwQFggUMAA&amp;usg=AFQjCNG1nvN_Z0knKTtEah3whTIObUAhcg"', '', '', '', '', '', '', '', ''), ('http://webcache.googleusercontent.com/search%3Fq%3Dcache:rzPQodkDKYYJ:https://www.teslamotors.com/%252BTesla%26gws_rd%3Dcr%26hl%3Des%26%26ct%3Dclnk&amp;sa=U&amp;ved=0ahUKEwjvzrTyxvTKAhXHWRoKHUjlBxwQIAgXMAA&amp;usg=AFQjCNEZ40VWO_fFDjXH09GakUOgODNlHg"', '', '', '', '', '', '', '', '')]
skeitel
  • 271
  • 2
  • 6
  • 17
  • 2
    Are you still [using regex to parse html?](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – zondo Feb 13 '16 at 12:18
  • I am a newbie, so I am using what I have guessed that was the best solution, but I suspect it is not, that's why I am asking. I am sure there is a better way or some module that will do it easier. I tried installing GoogleScraper module but for some reason neither pycharm nor pip could install it on my computer. – skeitel Feb 13 '16 at 12:48
  • I also tried this, and did not get me what I need, either: results = driver.find_elements_by_css_selector('div.g') link = results[0].find_element_by_tag_name("a") href = link.get_attribute("href") – skeitel Feb 13 '16 at 12:50
  • Have you taken a look at [urllib](https://docs.python.org/2/library/urllib.html)? – zondo Feb 13 '16 at 13:33
  • I did but I heard somewhere "don´t use URLlib in the future. It´s more complicated and slow than requests, so use requests". That´s why I tried the Selenium/Request route first. Maybe I am missing something. – skeitel Feb 14 '16 at 14:24

2 Answers2

0

You're looking for this, no selenium required (CSS selectors reference):

# container with needed data e.g: title, link, snippet, displayed link.
for result in soup.select('.tF2Cxc'):

  # grab only link from the container
  link = result.select_one('.yuRUbf a')['href']

Have a look at the SelectorsGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.


Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
  "q": "tesla",   # query
  "gl": "us",     # country to search from
  "hl": "en",     # language
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link, sep='\n')

---------
'''
https://www.tesla.com/
https://en.wikipedia.org/wiki/Tesla,_Inc.
https://en.wikipedia.org/wiki/Nikola_Tesla
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with such things since it's already done for the end-user, instead, you only need to iterate over structured JSON and get the data you want.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "tesla",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

---------
'''
https://www.tesla.com/
https://en.wikipedia.org/wiki/Tesla,_Inc.
https://en.wikipedia.org/wiki/Nikola_Tesla
'''

P.S - I wrote a blog post about how to reduce the chance of being blocked while web scraping search engines.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
-1

Never, ever, use regexp to parse HTML.

If you do the findall properly, you should be able to access the href attribute on each result.

sorin
  • 161,544
  • 178
  • 535
  • 806
  • 3
    What kind of arrogant and negative answer is this? "Properly"? If you´re going to mock my skills you could at least provide a solution. LIke I said, I am new to this. – skeitel Feb 14 '16 at 14:23
  • Does yours and similar comments mean that this is wrong? https://www.youtube.com/watch?v=GEshegZzt3M – skeitel Feb 14 '16 at 15:08