1

I am currently using python to scrape this site, with thousands of pages and it is doing fine, but it takes a couple of hours to go through all the pages in parts (because I have a short delay between each page which I believe is fair to the provider of the site.) However on the real site there is a dropdown menu with an option to display more results on the page. In the HTML that looks like this:

<div class="page-sizer">
    <select id="itemsPerPage" class="form-control input-sm">
            <option value="10" selected>10</option>
            <option value="50" >50</option>
            <option value="200" >200</option>
    </select>
</div>
<script>
    $(document).on('bb:ready', function () {
        var pageSizeOptions = {
            setPageSizeUrl: '/Pager/SetPageSize'
        };

        ScrapeThisWebsite.PageSize.init(pageSizeOptions);
    });
</script>

Is there any way for me to automatically display the 200 results per page instead of only 10 and save some time for the provider and me? The selection does not show in the link. So, if I copy the page-link to another browser, it returns to the default.

I'm going through the pages using the following simple steps:

myheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
    
page = requests.get(url,headers=myheaders)

Is it linked to how the page is loaded?

Sweta Jain
  • 3,248
  • 6
  • 30
  • 50
Jmark
  • 23
  • 5

1 Answers1

0

You can use selenium library to interact with the dropdown. Also, might be worth checking if there is an API from which you could fetch data directly. To see it inspect the page, go to Network tab and see Fetch/XHR, if API is there you could fetch data using requests library.

Here is how to select the value in the dropdown using selenium. More on select in the docs.

from selenium import webdriver
from selenium.webdriver.support.ui import Select

driver = webdriver.Chrome('/Users/username/chromedriver') # here is the path where your web driver is

#get the website
driver.get('https://yourwebsite.com')

# Get the element by ID
dropdown= Select(driver.find_element_by_id('itemsPerPage'))

#Click on the dropdown
dropdown.select_by_value('200')

Cassiopea
  • 255
  • 7
  • 16
  • 1
    Thank you for the insight. Just a few lines of feedback. I keept geeting a number of warnings using your code, so with the stuff in https://stackoverflow.com/questions/64717302/deprecationwarning-executable-path-has-been-deprecated-selenium-python I removed the warning from webdriver.Chrome and apperently they are changing the layout of using driver.find_element_by_id('itemsPerPage') to driver.find_element(By.ID, 'itemsPerPage'). Any way it is a much different way of interaction with the sites and it gives me the option to just "click" Next instead of calculation the number of pages an links – Jmark Oct 18 '21 at 12:43
  • @Jmark, thank you for letting me know, I'll look into that too! – Cassiopea Oct 18 '21 at 12:46