1

I'm using Python script to scrap google, this is what I get when script finishes. Imagine if I have 100 results (I showed 2 for example).

{'query_num_results_total': 'Око 64 резултата (0,54 секунде/и)\xa0', 'query_num_results_page': 77, 'query_page_number': 1, 'query': 'example', 'serp_rank': 1, 'serp_type': 'results', 'serp_url': 'example2.com', 'serp_rating': None, 'serp_title': '', 'serp_domain': 'example2.com', 'serp_visible_link': 'example2.com', 'serp_snippet': '', 'serp_sitelinks': None, 'screenshot': ''}
{'query_num_results_total': 'Око 64 резултата (0,54 секунде/и)\xa0', 'query_num_results_page': 77, 'query_page_number': 1, 'query': 'example', 'serp_rank': 2, 'serp_type': 'results', 'serp_url': 'example.com', 'serp_rating': None, 'serp_title': 'example', 'serp_domain': 'example.com', 'serp_visible_link': 'example.com', 'serp_snippet': '', 'serp_sitelinks': None, 'screenshot': ''}

This is script usage code

import serpscrap
import pprint
import sys

config = serpscrap.Config()
config_new = {
   'cachedir': '/tmp/.serpscrap/',
   'clean_cache_after': 24,
   'sel_browser': 'chrome',
   'chrome_headless': True,
   'database_name': '/tmp/serpscrap',
   'do_caching': True,
   'num_pages_for_keyword': 2,
   'scrape_urls': False,
   'search_engines': ['google'],
   'google_search_url': 'https://www.google.com/search?num=100',
   'executable_path': '/usr/local/bin/chromedriver',
    'headers': {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4',
      'Accept-Encoding': 'gzip, deflate, sdch',
      'Connection': 'keep-alive',
   },
}

arr = sys.argv

keywords = ['example']

config.apply(config_new)
scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
results = scrap.run()


for result in results:
    print(result)

I want to stop script if in results is some url I want, for example "example.com"

If I have https here 'serp_url': 'https://example2.com' I want to check it and stop script if I give argument without https, just example2.com. If it's not possible to check while script working, I will need explanation how to find serp_url by an argument I provided.

I'm not familiar with Python, but I'm building PHP application that will run this Python script and output results. But I don't want to work with results in PHP (extracting by serp_url etc,) I want everything to be done in Python.

Nicolas Marek
  • 41
  • 1
  • 6

2 Answers2

0

First of all you need to access serp_url's value.

Since result variable is a dictionary, typing result['serp_url'] will return each result's url.

Inside for-loop where you print your results you should add an if-statement where result['serp_url'] will be compared with a variable that contains your desired urls (i think you don't provide that info in your code). Maybe it could be something like the following:

for result in results:
    print(result)
    if my_url == result['serp_url']:
        exit

Same thinking in the case of https but now we need startswith() method:

for result in results:
    print(result)
    if my_url == result['serp_url']:
        exit
    if result['serp_url'].startswith('https'):
        exit

Hope it helps.

  • Thank you very much, it will be useful! But, I need my argument not to match exactly (==), but my serp_url should contain my argument. If serp_url is https://example.com/ with https://, and my argument is example.com, that statement should find match. Can that be done? – Nicolas Marek Sep 27 '18 at 08:52
  • I didn't understand that your desired urls is more than one. In this case Tzomas answer does the trick. – Andrew Syrmakesis Sep 27 '18 at 11:34
0

You can with something like this:

for result in results:
    if my_url in result['serp_url']:
    # this match 'myexample.com' in 'http://example.com'
    # or even more like 'http://example.com/whatever' and of course begining with 'https'
        exit

With any is another solution:

 if any((my_url in result['serp_url'] for result in results)):
     exit
Tzomas
  • 704
  • 5
  • 17