-1

So I wanna scrape google, I have successfully scraped craigslist using this method but I can't seam to scrape google for some reason (yes of course I changed the class and stuff..) this is what I want to scrape:

I want to scrape websites description:

image

from selenium import webdriver

path = r"C:\Users\Skid\Desktop\chromedriver.exe"

driver = webdriver.Chrome(path)

driver.get("https://www.google.com/#q=python+webscape+google")

posts = driver.find_elements_by_class_name("r")
for post in posts:
    print(post.text)
Termininja
  • 6,620
  • 12
  • 48
  • 49
Ghost
  • 17
  • 5
  • Probably google detected you as a bot. Try dumping the scraped web page, might have some clue why it's not working. – Wreeecks Jan 16 '16 at 16:39
  • Watcha mean try dumping the scraped webpage? I've scraped craigslist and it worked, give me an example? – Ghost Jan 16 '16 at 16:41
  • @KevinGuan's answer is correct. Just correct your url. Instead of `"#q="` it should be `"?q="` – JRodDynamite Jan 16 '16 at 16:52
  • @Ghost I mean can you dump the html that you need to parse? – Wreeecks Jan 16 '16 at 16:54
  • @bwaaaaaa: It's in the code. – Remi Guan Jan 16 '16 at 16:55
  • @KevinGuan i know that it's in the code. what i'm saying is, Is there a way to see the html data that he needs to be parsed? Just making sure he is scraping the correct page. – Wreeecks Jan 16 '16 at 16:59
  • @bwaaaaaa: Understand, are you talking about the image's link and the code's link looks isn't the same? – Remi Guan Jan 16 '16 at 17:00
  • @KevinGuan what do you get from this code -> `driver.get("https://www.google.com/#q=python+webscape+google")`? – Wreeecks Jan 16 '16 at 17:02
  • @bwaaaaaa: Yeah I know, and the image is `https://www.google.com/#q=Stack+Overflow` right? – Remi Guan Jan 16 '16 at 17:03
  • i mean the HTML of it. not the image. he's looking for an html tag right? – Wreeecks Jan 16 '16 at 17:07
  • @bwaaaaaa: Do you mean, the *real HTML source*, not only the URL? – Remi Guan Jan 16 '16 at 17:11
  • @KevinGuan yes exactly, the HTML code of `https://www.google.com/#q=Stack+Overflow`. – Wreeecks Jan 16 '16 at 17:15
  • @bwaaaaaa: Ah, fine. – Remi Guan Jan 16 '16 at 17:16
  • @KevinGuan so you got any Idea? Can you play around with something for me ? <3 – Ghost Jan 16 '16 at 19:17
  • 1. You're not supposed to scrape Google, it's in their terms of service. So either you are running into their protections against your code which could be a multitude of things, for example, CAPTCHA, IP blocking, etc. 2. Just because it worked for Craigslist doesn't mean it will work for Google. The internet doesn't work like that. All sites are different and Google especially takes major precautions to protect their service. 3. I see many references to `#q` but my URL uses `?q`. This was was already mentioned but you make no mention if you tried it. – Eugene van der Merwe Apr 19 '19 at 08:47

2 Answers2

0

Solved, Add a timer (import time, time.sleep(2)) before scraping.

Ghost
  • 17
  • 5
  • If you need to scrape larger amounts of results you can't use selenium anymore. It should work fine for lower amounts. You can take a look here to get deeper into the topic: http://google-scraper.squabbel.com/ P.S. you can mark your question as resolved by accepting your own answer – John Jan 04 '17 at 22:29
  • @john I took a look at that PHP scraper. Perhaps ten years ago it was working but now it's not. Now it's a way of soliciting eyeballs to that page and a paid service from which the original code evolved. Everywhere on stack where people mention google and scraper replies from yourself to said PHP script is made. Are you sure you are not the proprietor of this code? – Eugene van der Merwe Apr 19 '19 at 08:50
  • @EugenevanderMerwe I'm using it. Definitely not since 10 years but since a few. I am updating it myself from time to time, usually it's a few characters that need a change once a year. I'm sending in most of the fixes by e-mail and they sometimes get reflected in the website code. You can do the same – John Apr 29 '19 at 17:14
0

You can scrape Google Search Description Website using BeautifulSoup web scraping library.

More about what are CSS selectors are, and cons of using CSS selectors.

Check code in online IDE.

from bs4 import BeautifulSoup
import requests, lxml, json

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls

# this URL params is taken from the actual Google search URL
# and transformed to a more readable format
params = {
  "q": "python web scrape google",            # query
  "gl": "us",                                 # country to search from
  "hl": "en",                                 # language
}

html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

website_description_data = []

for result in soup.select(".tF2Cxc"):
  website_name = result.select_one(".yuRUbf a")["href"]
  description = result.select_one(".lEBKkf").text  

  website_description_data.append({
    "website_name" : website_name,
    "description" : description
  })

  print(json.dumps(website_description_data, indent=2))

Example output

[
  {
    "website_name": "https://practicaldatascience.co.uk/data-science/how-to-scrape-google-search-results-using-python",
    "description": "Mar 13, 2021 \u2014 First, we're using urllib.parse.quote_plus() to URL encode our search query. This will add + characters where spaces sit and ensure that the\u00a0..."
  }
]
[
  {
    "website_name": "https://practicaldatascience.co.uk/data-science/how-to-scrape-google-search-results-using-python",
    "description": "Mar 13, 2021 \u2014 First, we're using urllib.parse.quote_plus() to URL encode our search query. This will add + characters where spaces sit and ensure that the\u00a0..."
  },
  {
    "website_name": "https://stackoverflow.com/questions/38619478/google-search-web-scraping-with-python",
    "description": "You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top\u00a0..."
  }
  # ...
]
Denis Skopa
  • 1
  • 1
  • 1
  • 7