How to crawl website slower in Python with Jupyter Notebook?

Question

My current python script would perform web scraping on the website in one second with 2 pages. I want to make it go slower, like 25 seconds on one page. How do I do that?

I tried this following python script.

# Dependencies
from bs4 import BeautifulSoup
import requests
import pandas as pd

# Testing
linked = 'https://www.zillow.com/homes/for_sale/San-Francisco-CA/fsba,fsbo,fore,new_lt/house_type/20330_rid/globalrelevanceex_sort/37.859675,-122.285557,37.690612,-122.580815_rect/11_zm/{}_p/0_mmm/'
for link in [linked.format(page) for page in range(1,2)]:
    user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
    headers = {'User-Agent': user_agent}
    response = requests.get(link, headers=headers)
    soup = BeautifulSoup(response.text, 'html.pafinite-item')
print(soup)

What should I add to my script to make the web scraping go slower?

Possible duplicate of [How can I make a time delay in Python?](https://stackoverflow.com/questions/510348/how-can-i-make-a-time-delay-in-python) — Andrew Allen, Apr 17 '19 at 10:54
Some websites will treat you as robot and block you if you web scrape their website too fast. I have encounter a scenario where I receive a message about the website detected that I travel their website on super human speed and believe I am a bot. — hongkongbboy, Apr 17 '19 at 19:09

score 1 · Accepted Answer · answered Apr 17 '19 at 10:51

1

Just use time.sleep:

import requests
import pandas as pd

from time import sleep
from bs4 import BeautifulSoup

linked = 'https://www.zillow.com/homes/for_sale/San-Francisco-CA/fsba,fsbo,fore,new_lt/house_type/20330_rid/globalrelevanceex_sort/37.859675,-122.285557,37.690612,-122.580815_rect/11_zm/{}_p/0_mmm/'

for link in [linked.format(page) for page in range(1,2)]:
    sleep(25.0)
    user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
    headers = {'User-Agent': user_agent}
    response = requests.get(link, headers=headers)
    soup = BeautifulSoup(response.text, 'html.pafinite-item')
print(soup)

answered Apr 17 '19 at 10:51

gmds

19,325
4
32
58

Thank you so much for the help. I have a question regarding to the use of time.sleep function. Based on the name of the function, it seems to suggest that the script will take a 25 seconds break after scraping each page. It's like a human took a 25 seconds break after running each mile. Instead, is there a way to make that person walk 25 seconds for a mile instead. The goal of my school project is to build a machine learning model, and therefore, I need to web scrape the same website monthly for a few month. I don't want them to block me and one of the ways I heard is to crawl gently. Thanks! – hongkongbboy Apr 17 '19 at 19:21
I don't really understand what you mean. Are you saying that you want to take 25 seconds to load the page? If so, that's not really what websites look for. There are other things, but accessing it once every 25 seconds will normally be sufficient. – gmds Apr 17 '19 at 22:07

How to crawl website slower in Python with Jupyter Notebook?

1 Answers1