0

I have a script to scrape products from eBay and other sites and this works for the first 4 pages but when I get past the 5th I can't find anything (yes the item I'm looking for has more then 5 pages)

I've tried the solution done here but it doesn't work, if I add the timeout I get ReadTimeout: HTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=2) and if I don't add the timeout the request just goes into a loop

ex: search_term = 'gtx+1050ti'

page_num = 1
while True:
    headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
    page = requests.get(f'https://www.ebay.com/sch/i.html?_nkw={search_term}&_fcid=164&_sop=15&_pgn={page_num}', headers=headers)
    soup = BeautifulSoup(page.text, 'lxml')

    products = soup.find_all('li', class_='s-item')
    print(products)

    if not products:
        break
    page_num = page_num + 1

When the page_num > 4 products = []

André Clérigo
  • 846
  • 1
  • 10
  • 30
  • 4
    I was able to run the code for 200 pages for the given search term by just changing `headers = {'user-agent': 'Mozilla/5.0'}` – Kamalesh S Sep 21 '21 at 14:31

1 Answers1

0

Possibly a timeout which will be set to, for example, 120 seconds will solve this. If the site detects that requests are made by a bot it can hang forever until throws an error.

The additional thing is you were using a very old user-agent. Have a look what's your new, current user-agent which you pass to request, and the website checks user-agent version and if it's old, it's most likely a bot that sends a request.

To collect the information you need, it is not reliable to use a product variable check that looks only for listings, because even on the last empty page where there are no more results, this selector will still be present, so the loop will be endless.

To get around this we need to use a selector that will disappear when no more listings are left, which is a .pagination_next selector in this case. It disappears when there are no more listings present so this is the signal to exit the while loop.

Look at these screenshots for better understanding:

image image image

Check code in online IDE.

from bs4 import BeautifulSoup
import requests, lxml, json

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
    }
data = []
page_num = 1
search_term = 'gtx+1050ti'
while True:
    
    page = requests.get(f'https://www.ebay.com/sch/i.html?_nkw={search_term}&_fcid=164&_sop=15&_pgn={page_num}', headers=headers, timeout=30)
    soup = BeautifulSoup(page.text, 'lxml')
    print(f"Extracting page: {page_num}")

    print("-" * 10)
    

    for products in soup.select(".s-item__info"):
        title = products.select_one(".s-item__title span").text
        price = products.select_one(".s-item__price").text
        
        data.append({
          "title" : title,
          "price" : price
        })

    if soup.select_one(".pagination__next"):
        page_num += 1
    else:
        break

    print(json.dumps(data, indent=2, ensure_ascii=False))

Example output

Extracting page: 1
----------
[
  {
    "title": "FAN & SCREWS FOR EVGA GeForce GTX 1050 TI SC Gaming Graphics Card 04G-P4-6253-KR",
    "price": "$15.00"
  },
  {
    "title": "GTX1050TI Desktop Video Card Stable Output DDR5 High Performance Gaming Graphics",
    "price": "$68.25"
  },
  {
    "title": "GTX1050TI Graphics Card 4GB Low Noise Sturdy Reliable for Computer",
    "price": "$69.04"
  },
  {
    "title": "GTX1050TI Gaming Graphics Card Powerful Low Noise High Clarity Discrete Gaming",
    "price": "$71.84"
  },
  {
    "title": "GTX1050TI Gaming Graphics Card Powerful Low Noise High Clarity Discrete Gaming",
    "price": "$71.84"
  },
  # ...
]
Denis Skopa
  • 1
  • 1
  • 1
  • 7