2

I have been trying to scrape the number of results within a certain date range on google. I have done this by inserting the date into the google search query.However, the code I wrote is getting the number of results for the search out of the date range. My code is the following:

query='Kevin Spacey prima:14-01-2020 dopo:14-01-2020'

for url in search(
            query, 
            tld='it',
            lang='it',
            num=20,
            start=0,
            stop=None,
            pause=2.0
      ):
            try: 
                r = requests.get(url, timeout=None)
                r.headers
                r.status_code
                urls.append(url)              
            except: 
                pass

From Google search I am getting 13 results; using my code 39. The problem is that 'my' results do not match with those ones from google. I think the problem is in the query, specifically in the date range, but I am not completely sure how to fix it. Maybe there s also another error that I have not spotted yet. I hope you can tell me what I am doing wrong.

Thank you for your time and help.

Please see here the results from Google and below the outputs from my code.

https://tv.zam.it/programmi_in_tv_stasera.php
https://www.paramountnetwork.it/video/v5ln5t/film-paramount-network-gli-highlights-per-la-settimana-del-2-marzo-2020
https://www.davidemaggio.it/archives/181396/programmi-tv-di-stasera-martedi-14-gennaio-2020-su-rai2-il-film-amore-cucina-e-curry-al-posto-de-il-molo-rosso-spostato-in-seconda-serata
https://www.davidemaggio.it/archives/181401/ascolti-tv-lunedi-13-gennaio-2020
https://www.mymovies.it/film/2016/elvisnixon/pubblico/?id=778281
https://www.ilfoglio.it/siteMapVideo.jsp
http://www.starpolitics.it/author/redazione/page/2/
http://www.zorrolaleggenda.rai.it/dl/RaiTV/programmi/media/ContentItem-4acbbd88-0529-4ca5-a390-96cb38dd2317.html
https://www.lagazzettadellospettacolo.it/cinema/26473-nicholas-hoult-giurati-giffoni-film-festival-2016/
https://www.viaggiareleggeri.com/cerca/x/i
https://www.lagazzettadellospettacolo.it/musica/30431-peter-cincotti-live-italia/
https://www.viaggiareleggeri.com/cerca/x/-?ref=28250
https://www.audible.it/pd/Harry-Potter-e-il-Prigioniero-di-Azkaban-Harry-Potter-3-Audiolibri/B077HVX4WM
https://www.hfw.com/Briefings
http://www.inmediarex.it/cinema-tv/cinema-tv-recensioni/american-gods-la-serie-niente-di-cosi-divino/
http://america24.com/sitemapArticles.xml
https://www.weenjoy.net/sitemap/
https://ierioggidomaniblog.com/2017/06/02/e-arrivata-la-promo-shock-universal-su-amazon-tante-offerte-fino-al-2-luglio/
https://ierioggidomaniblog.com/2018/01/13/universal-pictures-baby-driver-barry-seal-linganno-e-madre/
https://www.glartent.com/IT/Rome/112229858801846/giovani-artisti-associati-srl
https://tubestar.it/breakingitaly
https://www.freeforumzone.com/d/1543749/Oggi-ho-visto-in-TV/discussione.aspx/18
https://mjj.freeforumzone.com/discussione.aspx?idd=662389
https://www.diariodelweb.it/tuttosu/tag/?q=4750
https://civiltascomparse.wordpress.com/category/p-greco/?ak_action=reject_mobile
https://www.ubook.com/audiobook/348309/copy-persuasivo-di-andrea-lisi
https://ipersphera.org/category/attrice/
https://www.luogocomune.net/28-opinione/4827-svezia-laboratorio-per-il-nwo
https://www.globalnpo.org/IT/Salerno/1382814642039640/La-Bottega-Di-Will
https://www.qoop.it/osvaldo-raschi-pugile?page=1
https://www.qoop.it/pugile-al-cogan?filter=lastyear
http://www.caminantes.it/page-16/index.php?categories=giornalisti
https://www.altadefinizione01.tel/10495-terminator-destino-oscuro-stream-ita.html
https://www.emailers.it/codice-sconto-del-50-cibdol-10-promozione-limitata/
https://aimatrabolmeicher.com/2014/03/03/oscar-2014-and-the-winner-is/
https://aimatrabolmeicher.com/goodbye/page/2365/
http://scandalissimi.it/home-archive.php
https://picnano.com/tags/prossimieventi
https://vilook.com/video/9E0I69VkXFc/il-lento-declino-dellitalia-qual-%C3%A8-il-vero-problema-breakingitaly-news

Total websites: 39 (including an HTTP error)

Update:

Here is the url with all the results after customising the research:

https://www.google.co.uk/search?q=Kevin%20Spacey&lr=lang_it&cr=countryIT&hl=it&as_qdr=all&tbs=lr:lang_1it,ctr:countryIT,cdr:1,cd_min:1/14/2020,cd_max:1/14/2020&ei=WiRtXpLRH8Wb1fAPgMuTiAI&start=0&sa=N&ved=2ahUKEwiS5tj_zZroAhXFTRUIHYDlBCE4ChDy0wN6BAgEEC4

Fields that I need to look at in order to implement them in the code:

www.google.co.uk ; I would prefer to look at www.google.it
q=Kevin+spacey
lr=lang_it
cr=countryIT
hl=it
tbs=lr:lang_1it,ctr:countryIT,cdr:1,cd_min:1/14/2020,cd_max:1/14/2020
  • 1
    As an aside, don't use `except Exception` like that, see https://stackoverflow.com/questions/54948548/what-is-wrong-with-using-a-bare-except. – AMC Mar 14 '20 at 19:31
  • Could you please let me know if it is ok how I edited the code including what you suggested? many thanks. However, I am still getting different results (same as before) from my code comparing them with google's –  Mar 14 '20 at 20:46
  • I have tried with different date format. I think the problem is there. However I do not know how to limit the search results between January 14th 2020 and January 14th 2020 (i.e. same specified day). –  Mar 14 '20 at 21:36
  • From which library does `search()` come from? – Jack Fleeting Mar 14 '20 at 21:50
  • I think the parameter to include would be `tbs='sbd:1,cd_min:1/14/2020,cd_max:1/14/2020'`but I am not sure about the sbd code (1?) for searching Italian results. Does anyone know how to fix it? Could it be possible that it works but something is wrong in the domain/country? If you could try it and see if it works for you, it would great. –  Mar 16 '20 at 02:09
  • @Val you have to set `tbs` and `country` parameters. Please check my answer if it's working for you. – Christos Lytras Mar 20 '20 at 20:02

1 Answers1

0

The query that returns 13 results, uses tbs param to specify date limits and not inline query prima:14-01-2020 dopo:14-01-2020. googlesearch supports tbs and there is even a helper function get_tbs you can use and pass datetime.date from and to. You also have to specify country to be countryIT as you have in your query.

The whole working script:

from googlesearch import search, get_tbs
import datetime

# query='Kevin Spacey prima:14-01-2020 dopo:14-01-2020'
query='Kevin Spacey'

urls = []
index = 0

for url in search(
    query, 
    tld='it',
    lang='it',
    country='countryIT',
    num=20,
    start=0,
    stop=None,
    pause=2.0,
    tbs=get_tbs(
        datetime.date(2020, 1, 14),
        datetime.date(2020, 1, 14))
):
    urls.append(url)
    print("%d: %s" % (index, url))
    index += 1

print("\nTotal results found: %d\n" % (len(urls)))

Will output:

0: https://www.cinematown.it/2020-01-oscar-2020-previsioni-scommesse/
1: https://www.cinematown.it/2020-01-notte-sul-pianeta-terra-trailer/
2: https://blog.italiansubs.net/critics-choice-awards-2020-i-vincitori/
3: https://www.amazon.it/Patrick-DVD/dp/B07J33SHLC
4: http://www.viraland.it/2020/01/14/cinema-e-gioco-i-migliori-film-ispirati-al-gaming/
5: https://www.altadefinizione01.tel/catalog/t/
6: https://www.altadefinizione01.tel/10495-terminator-destino-oscuro-stream-ita.html
7: https://www.sentieridelcinema.it/oscar-2020-tutte-le-nomination/
8: https://www.dailymood.it/2020/01/14/nomination-oscar-2020-comanda-joker-tarantino-e-scorsese-lo-tallonano/
9: https://www.cineblog.it/post/932961/bloodshot-nuovo-trailer-vin-diesel-film
10: https://www.cineblog.it/post/932933/black-widow-film-nuovo-trailer
11: https://www.davidemaggio.it/archives/181403/la-guerra-non-e-finita
12: https://www.davidemaggio.it/archives/181385/festival-di-sanremo-2020-donne-chi-sono
13: https://www.rossinavi.it/column/money/2408/

Total results found: 14
Christos Lytras
  • 36,310
  • 4
  • 80
  • 113
  • Thank you for your answer Christos. I got the following error: `ImportError: cannot import name 'get_tbs' from 'googlesearch' (/anaconda3/lib/python3.7/site-packages/googlesearch/__init__.py)` , do you know why? –  Mar 20 '20 at 20:14
  • If I define `get_tbs` using its definition, I got the following error `search() got an unexpected keyword argument 'country'`. I am using `Anaconda Navigator -> Jupiter (Python 3.7)` –  Mar 20 '20 at 20:19
  • @Val are you using [`google-search`](https://pypi.org/project/google-search/) for Python 2? Because I am using the Python 3 and updated version of [`google`](https://pypi.org/project/google/). If you do, I suggest to switch to Python 3 and `google` package which works almost the same and it has these extra features. – Christos Lytras Mar 20 '20 at 20:42
  • I am using `Python 3` from Jupyter notebook. Do you know how to check it to be completely sure of that? It seems there is only one option and it is creating a new notebook `Python 3`. I installed `pip install google` but it is already satisfied. –  Mar 20 '20 at 20:50
  • Yes, you are running Python 3 so it can't be `google-search` it's for Python 2. `google` is the right one, can you check if you have the latest version [`google 2.0.3`](https://pypi.org/project/google/)? You can see my code working in this Replit [PythonGoogleSearchResults](https://repl.it/@ChristosLytras/PythonGoogleSearchResults), open the link and hit the run button. – Christos Lytras Mar 20 '20 at 20:53
  • I had version 2.0.2 so I installed the latest version google 2.0.3. However, it still getting the same error about country. Do you know how could I fix on Jupyter notebook or elsewhere (suggesting another environment)? I used your code (copied and pasted). Thank you for helping me –  Mar 20 '20 at 21:00
  • Can you run `pip3 --version` and `pip --version` and comment with the outputs here? – Christos Lytras Mar 20 '20 at 21:04
  • Please see the outputs: `!pip3 --version` -> `pip 20.0.2 from /anaconda3/lib/python3.7/site-packages/pip (python 3.7)` `!pip --version` -> `pip 20.0.2 from /anaconda3/lib/python3.7/site-packages/pip (python 3.7)` –  Mar 20 '20 at 21:07
  • It's the same. I'll check Jupyter notebook to see if I get any issues. – Christos Lytras Mar 20 '20 at 21:09
  • Did you restart Jupyter notebook after upgrading to [`google 2.0.3`](https://pypi.org/project/google/)? – Christos Lytras Mar 20 '20 at 21:11
  • No, you are right! Now it works fine. Thank you so much for all the time you spent helping me, Christos! –  Mar 20 '20 at 21:26
  • 1
    @Val nice you've got it working and thank you for the bounty. I should have though to tell you to restart Jupyter after updating `google` package. Please don't forget to mark the answer as accepted, as it will help other users having the same issue like you in the future. – Christos Lytras Mar 20 '20 at 21:32