0

I am scraping one particular page with the a headless chromedriver

The page is really huge, to load it entirely I need 10k+ clicks on a lazy load button

The more I click, the slower things get

Is there a way to make the process faster?

Here is the code:

def driver_config():
    chrome_options = Options()
    prefs = {"profile.managed_default_content_settings.images": 2}
    chrome_options.add_experimental_option("prefs", prefs)
    chrome_options.page_load_strategy = 'eager'
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(options=chrome_options)
    return(driver)

def scroll_the_category_until_the_end(driver, category_url):

    driver.get(category_url)
    
    pbar = tqdm()
    pbar.write('initializing spin')
    
    while True:
        try:
            show_more_button = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div/div[2]/div[2]/div[2]/button')))
            driver.execute_script("arguments[0].click();", show_more_button)
            pbar.update()
                
        except TimeoutException:
            pbar.write('docking')
            pbar.close()
            break

driver = driver_config()
scroll_the_category_until_the_end(driver, 'https://supl.biz/russian-federation/stroitelnyie-i-otdelochnyie-materialyi-supplierscategory9403/')

UPDATE:

I also tried to implement another strategy but it didn't work:

  • deleting all company information on every iteration
  • clearing driver cash

My hypothesis was that if I do this, DOM will always be clean and fast

driver = driver_config()
driver.get('https://supl.biz/russian-federation/stroitelnyie-i-otdelochnyie-materialyi-supplierscategory9403/')

pbar = tqdm()
pbar.clear()

while True:
    try:    

        for el in driver.find_elements_by_class_name('a_zvOKG8vZ'):
            driver.execute_script("""var element = arguments[0];element.parentNode.removeChild(element);""", el)
        
        button = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//*[contains(text(), 'Показать больше поставщиков')]")))
        driver.execute_script("arguments[0].click();", button)
        pbar.update()
        driver.execute_script("window.localStorage.clear();")

    except Exception as e:
        pbar.close()
        print(e)
        break
xxx45yb
  • 5
  • 2

1 Answers1

0

First the website invokes javascript to grab new data, the HTTP request is invoked by clicking the more results button, it calls upon an API and the response back is the data needed to load the page with more results. You can view this request by inspecting the page --> Network tools --> XHR and then clicking the button. It sends an HTTP GET request to an API which has data on each product.

The most efficient way to grab data from a website that invokes javascript is by re-engineering this HTTP request the javascript is making.

In this case it's relatively easy, I copied the request in a cURL command within XHR of inspecting the page and converted it using curl.trillworks.com to python.

This is the screen you get to with XHR, before clicking the more results page.

enter image description here

Clicking the more results page you get this, notice how a request has populated the screen ?

enter image description here

Here I'm copying the cURL request to grab the necessary headers etc...

enter image description here

Here I'm copying the cURL request to grab the necessary headers etc... you can then input this into curl.trillworks.com and it converts the request into params, cookies and headers and gives you boilerplate for the requests package.

Had a play around with the request using the requests package. Inputting various parts of the headers, you are provided cookies, but they're actually not necessary when you make the request.

The simplest request to make is one without headers, parameters or cookies but most API endpoints don't accept this. In this case, having played around with the requests package, you need a user-agent and the parameters that specify what data you get back from the API. Infact you don't even need a valid user-agent.

Now you could invoke a while loop to keep making HTTP requests in sizes of 8. Unfortunately altering the size of the request in the parameters doesn't get you all the data!

Coding Example

import requests
import time

i = 8
j = 1

headers = { 
        'user-agent': 'M' 
        }

while True: 
    if response.status_code == 200: 
        params = (
            ('category', '9403'),
            ('city', 'russian-federation'),
            ('page', f'{j}'),
            ('size', f'{i}'),
         )
        response = requests.get('https://supl.biz/api/monolith/suppliers-catalog/search/', headers=headers, params=params)
        print(response.json()['hits'][0])
        i += 8
        j += 1
        time.sleep(4)
    else: 
        break

Output

Sample output

{'id': 1373827,
 'type': None,
 'highlighted': None,
 'count_days_on_tariff': 183,
 'tariff_info': {'title_for_show': 'Поставщик Премиум',
  'finish_date': '2021-02-13',
  'url': '/supplier-premium-membership/',
  'start_date': '2020-08-13'},
 'origin_ru': {'id': 999, 'title': 'Санкт-Петербург'},
 'title': 'ООО "СТАНДАРТ 10"',
 'address': 'Пискаревский проспект, 150, корпус 2.',
 'inn': '7802647317',
 'delivery_types': ['self', 'transportcompany', 'suppliercars', 'railway'],
 'summary': 'Сэндвич-панели: новые, 2й сорт, б/у. Холодильные камеры: новые, б/у. Двери для холодильных камер: новые, б/у. Строительство холодильных складов, ангаров и др. коммерческих объектов из сэндвич-панелей. Холодильное оборудование: новое, б/у.',
 'phone': '79219602762',
 'hash_id': 'lMJPgpEz7b',
 'payment_types': ['cache', 'noncache'],
 'logo_url': 'https://suplbiz-a.akamaihd.net/media/cache/37/9e/379e9fafdeaab4fc5a068bc90845b56b.jpg',
 'proposals_count': 4218,
 'score': 42423,
 'reviews': 0,
 'rating': '0.0',
 'performed_orders_count': 1,
 'has_replain_chat': False,
 'verification_status': 2,
 'proposals': [{'id': 20721916,
   'title': 'Сэндвич панели PIR 100',
   'description': 'Сэндвич панели. Наполнение Пенополиизлцианурат ПИР PIR. Толщина 100мм. Длина 3,2 метра. Rall9003/Rall9003. Вналичии 600м2. Количество: 1500',
   'categories': [135],
   'price': 1250.0,
   'old_price': None,
   'slug': 'sendvich-paneli-pir-100',
   'currency': 'RUB',
   'price_details': 'Цена  за шт.',
   'image': {'preview_220x136': 'https://suplbiz-a.akamaihd.net/media/cache/72/4d/724d0ba4d4a2b7d459f3ca4416e58d7d.jpg',
    'image_dominant_color': '#ffffff',
    'preview_140': 'https://suplbiz-a.akamaihd.net/media/cache/67/45/6745bb6f616b82f7cd312e27814b6b89.jpg',
    'hash': 'd41d8cd98f00b204e9800998ecf8427e'},
   'additional_images': [],
   'availability': 1,
   'views': 12,
   'seo_friendly': False,
   'user': {'id': 1373827,
    'name': 'ООО "СТАНДАРТ 10"',
    'phone': '+79219602762',
    'address': 'Пискаревский проспект, 150, корпус 2.',
    'origin_id': 999,
    'country_id': 1,
    'origin_title': 'Санкт-Петербург',
    'verified': False,
    'score': 333,
    'rating': 0.0,
    'reviews': 0,
    'tariff': {'title_for_show': 'Поставщик Премиум',
     'count_days_on_tariff': 183,
     'finish_date': '2021-02-13',
     'url': '/supplier-premium-membership/',
     'start_date': '2020-08-13'},
    'performed_orders_count': 1,
    'views': 12,
    'location': {'lon': 30.31413, 'lat': 59.93863}}},
  {'id': 20722131,
   'title': 'Сэндвич панели ппу100 б/у, 2,37 м',
   'description': 'Сэндвич панели. Наполнение Пенополиуретан ППУ ПУР PUR. Толщина 100 мм. длинна 2,37 метра. rall9003/rall9003. БУ. В наличии 250 м2.',
   'categories': [135],
   'price': 800.0,
   'old_price': None,
   'slug': 'sendvich-paneli-ppu100-b-u-2-37-m',
   'currency': 'RUB',
   'price_details': 'Цена  за шт.',
   'image': {'preview_220x136': 'https://suplbiz-a.akamaihd.net/media/cache/d1/49/d1498144bc7b324e288606b0d7d98120.jpg',
    'image_dominant_color': '#ffffff',
    'preview_140': 'https://suplbiz-a.akamaihd.net/media/cache/10/4b/104b54cb9b7ddbc6b2f0c1c5a01cdc2d.jpg',
    'hash': 'd41d8cd98f00b204e9800998ecf8427e'},
   'additional_images': [],
   'availability': 1,
   'views': 4,
   'seo_friendly': False,
   'user': {'id': 1373827,
    'name': 'ООО "СТАНДАРТ 10"',
    'phone': '+79219602762',
    'address': 'Пискаревский проспект, 150, корпус 2.',
    'origin_id': 999,
    'country_id': 1,
    'origin_title': 'Санкт-Петербург',
    'verified': False,
    'score': 333,
    'rating': 0.0,
    'reviews': 0,
    'tariff': {'title_for_show': 'Поставщик Премиум',
     'count_days_on_tariff': 183,
     'finish_date': '2021-02-13',
     'url': '/supplier-premium-membership/',
     'start_date': '2020-08-13'},
    'performed_orders_count': 1,
    'views': 4,
    'location': {'lon': 30.31413, 'lat': 59.93863}}},
  {'id': 20722293,
   'title': 'Холодильная камера polair 2.56х2.56х2.1',
   'description': 'Холодильная камера. Размер 2,56 Х 2,56 Х 2,1. Камера из сэндвич панелей ППУ80. Камера с дверью. -5/+5 или -18. В наличии. Подберем моноблок или сплит систему. …',
   'categories': [478],
   'price': 45000.0,
   'old_price': None,
   'slug': 'holodilnaya-kamera-polair-2-56h2-56h2-1',
   'currency': 'RUB',
   'price_details': 'Цена  за шт.',
   'image': {'preview_220x136': 'https://suplbiz-a.akamaihd.net/media/cache/c1/9f/c19f38cd6893a3b94cbdcbdb8493c455.jpg',
    'image_dominant_color': '#ffffff',
    'preview_140': 'https://suplbiz-a.akamaihd.net/media/cache/4d/b0/4db06a2508cccf5b2e7fe822c1b892a2.jpg',
    'hash': 'd41d8cd98f00b204e9800998ecf8427e'},
   'additional_images': [],
   'availability': 1,
   'views': 5,
   'seo_friendly': False,
   'user': {'id': 1373827,
    'name': 'ООО "СТАНДАРТ 10"',
    'phone': '+79219602762',
    'address': 'Пискаревский проспект, 150, корпус 2.',
    'origin_id': 999,
    'country_id': 1,
    'origin_title': 'Санкт-Петербург',
    'verified': False,
    'score': 333,
    'rating': 0.0,
    'reviews': 0,
    'tariff': {'title_for_show': 'Поставщик Премиум',
     'count_days_on_tariff': 183,
     'finish_date': '2021-02-13',
     'url': '/supplier-premium-membership/',
     'start_date': '2020-08-13'},
    'performed_orders_count': 1,
    'views': 5,
    'location': {'lon': 30.31413, 'lat': 59.93863}}},
  {'id': 20722112,
   'title': 'Сэндвич панели ппу 80 б/у, 2,4 м',
   'description': 'Сэндвич панели. Наполнение ППУ. Толщина 80 мм. длинна 2,4 метра. БУ. В наличии 350 м2.',
   'categories': [135],
   'price': 799.0,
   'old_price': None,
   'slug': 'sendvich-paneli-ppu-80-b-u-2-4-m',
   'currency': 'RUB',
   'price_details': 'Цена  за шт.',
   'image': {'preview_220x136': 'https://suplbiz-a.akamaihd.net/media/cache/ba/06/ba069a73eda4641030ad69633d79675d.jpg',
    'image_dominant_color': '#ffffff',
    'preview_140': 'https://suplbiz-a.akamaihd.net/media/cache/4f/e9/4fe9f3f358f775fa828c532a6c08e7f2.jpg',
    'hash': 'd41d8cd98f00b204e9800998ecf8427e'},
   'additional_images': [],
   'availability': 1,
   'views': 8,
   'seo_friendly': False,
   'user': {'id': 1373827,
    'name': 'ООО "СТАНДАРТ 10"',
    'phone': '+79219602762',
    'address': 'Пискаревский проспект, 150, корпус 2.',
    'origin_id': 999,
    'country_id': 1,
    'origin_title': 'Санкт-Петербург',
    'verified': False,
    'score': 333,
    'rating': 0.0,
    'reviews': 0,
    'tariff': {'title_for_show': 'Поставщик Премиум',
     'count_days_on_tariff': 183,
     'finish_date': '2021-02-13',
     'url': '/supplier-premium-membership/',
     'start_date': '2020-08-13'},
    'performed_orders_count': 1,
    'views': 8,
    'location': {'lon': 30.31413, 'lat': 59.93863}}},
  {'id': 20722117,
   'title': 'Сэндвич панели ппу 60 мм, 2,99 м',
   'description': 'Сэндвич панели. Наполнение Пенополиуретан ППУ ПУР PUR . Новые. В наличии 600 м2. Толщина 60 мм. длинна 2,99 метров. rall9003/rall9003',
   'categories': [135],
   'price': 1100.0,
   'old_price': None,
   'slug': 'sendvich-paneli-ppu-60-mm-2-99-m',
   'currency': 'RUB',
   'price_details': 'Цена  за шт.',
   'image': {'preview_220x136': 'https://suplbiz-a.akamaihd.net/media/cache/e2/fb/e2fb6505a5af74a5a994783a5e51600c.jpg',
    'image_dominant_color': '#ffffff',
    'preview_140': 'https://suplbiz-a.akamaihd.net/media/cache/9c/f5/9cf5905a26e6b2ea1fc16d50c19ef488.jpg',
    'hash': 'd41d8cd98f00b204e9800998ecf8427e'},
   'additional_images': [],
   'availability': 1,
   'views': 10,
   'seo_friendly': False,
   'user': {'id': 1373827,
    'name': 'ООО "СТАНДАРТ 10"',
    'phone': '+79219602762',
    'address': 'Пискаревский проспект, 150, корпус 2.',
    'origin_id': 999,
    'country_id': 1,
    'origin_title': 'Санкт-Петербург',
    'verified': False,
    'score': 333,
    'rating': 0.0,
    'reviews': 0,
    'tariff': {'title_for_show': 'Поставщик Премиум',
     'count_days_on_tariff': 183,
     'finish_date': '2021-02-13',
     'url': '/supplier-premium-membership/',
     'start_date': '2020-08-13'},
    'performed_orders_count': 1,
    'views': 10,
    'location': {'lon': 30.31413, 'lat': 59.93863}}}]}
​

Explanation

Here we're making sure that the response status is 200 before making another request. Using f-strings we change the page by 1 and the size of the results of the JSON object by 8 each iteration of the while loop. I've imposed a time restriction per request, because if push too many HTTP request at once you'll end up getting IP banned. Be gentle on the server!

The response.json() method converts the JSON object to python dictionary, you haven't specified what data, but I think if you can handle a python dictionary you can grab the data you require.

Comments

Here is where the parameters comes from. You can see the pages and size data here.

enter image description here

AaronS
  • 2,245
  • 2
  • 6
  • 16
  • I've added some pictures to show the step through process. As I said in the original post, I copied the request made of the server to load more results as a cURL command. Using curl.trillworks I was able to convert this into python friendly request with all the necessary params, cookies etc... By just making several requests with different things, I started without the cookies, I was still able to make the correct request, I then altered the headers, and found out I only needed to use user-agent to get the right request. The params variable was always going to be necessary. – AaronS Aug 20 '20 at 07:40
  • Have you followed the steps exactly as I describe in the picture. The request gets populated on the screen when you click the more results button. That’s what it means when JavaScript is making an HTTP request. It only appears when the button is clicked. – AaronS Aug 20 '20 at 07:51
  • The parameters are there when you see the request you can click it and a right-hand box appears with all the data you require. I'm just a bit lazy I don't want to have to create a python dictionary by copying and pasting several times. – AaronS Aug 20 '20 at 08:37
  • I've updated the answer with another picture to explain where you get it. When you're scrolling down the headers column at the bottom is usually where the query set parameters are, this is what you have to pass to the request to get the information you want. But as you can see, you have to create that dictionary yourself. – AaronS Aug 20 '20 at 09:16
  • by the way, what to do if after a certain number of requests, response.json()['hits'] returns empty lists (which should not be the case because in https://supl.biz/russian-federation/iskusstvo-i-kultura-supplierscategory7696/ url, where 7696 is the number in the category parameter there are at least 1000 companies, while after 50th iteration (50*8 companies = 400 companies) only empty lists appear – xxx45yb Aug 20 '20 at 11:14
  • I'm not sure there's much you can do, it depends on the API endpoint how much data to display. It might be that there's a restriction on anyhow many companies to display but if you were to search for the company not displayed you'd find it. Amazon does the same, it only shows a portion of the library but if you search for something you'll find it. You may have to resort to selenium in that case if you think you can get the extra data that way. It isn't efficient and should be a last resort. – AaronS Aug 20 '20 at 15:45
  • I see, I updated the original question with another approach I tried to implement. What do you think about this? – xxx45yb Aug 20 '20 at 16:22
  • I think that you'll find it relatively difficult. The reason being selenium is making a synchronous request. One request one response. It's hard to do that in parallel when that is what you're doing. That being said. https://stackoverflow.com/questions/53779112/how-to-speed-up-java-selenium-script-with-minimum-wait-time Might help. Selenium is essentially invoking browser activity, it will always be as fast a browser would be. – AaronS Aug 20 '20 at 18:40