0

I have a class I use for scraping a site. It uses the requests library sessions and looks something like this:

class Scraper:
    def scrape_page(self):
        """Scrapes the current page"""

        # do something
        self.request_next_page()

    def request_next_page(self):
        """ Finds the 'next' link if available and requests next page"""

I want to create a method in the class that allows a parameter that can either scrape n number of pages or all the pages until there is no next page. The above methods work fine.

However, I don't know of a way to let the parameter be either an integer or just a simple True for all. I'm trying to think of the best way to do this.

I want something similar to this:

    def scrape_pages(self, num):
        """Scrape n number of pages"""

Where it can be ran as such:

>>>  s = Scraper()
>>>  s.scrape_pages(5)       # scrape the first 5 pages.

or

>>>  s = Scraper()
>>>  s.scrape_pages(all)     # where all can be True, or anything else that works. I'm not sure.

I know I could have two separate functions. Or have an if statement to check whether it is True or just an integer, and then run a different loop depending on the situation (maybe a for if integer, and a while if something else. I am just seeing if there is a better way to do this?

I noticed the .split() method kind of does something similar. Where maxsplit can have a limit or not. However, I am not familiar with C to be able to understand how that was accomplished.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
  • You can use 'wrong' value like 0 as a flag for 'all' - if n=0, then it means all the pages, else - fetch only n pages. – TDG Nov 05 '22 at 09:01
  • Use a default argument: `def scrape_pages(self, num=-1)`. You would then simply call `s.scrape_pages()` to get *all* pages. The implementation of the method would then be: `while num != 0: num -= 1; self.request_next_page()`. Using "magic" values like `"all"` to get different behaviour is sometimes considered unpythonic. – ekhumoro Nov 05 '22 at 12:36

1 Answers1

0

This funtionality can be implemented using either a while-loop or recursion. While-loops tend to be more efficient and easier to read. Also, recursion can raise errors when the number of nested calls is too large.

For the main loop in scrape_pages I would do something like this:

def scrape_pages(self, num):
    """Scrape n number of pages
       Parameters:
           num: integer or 'all' (fetch all pages)
    """
    if num == 'all':
        num = float('inf')
    else:
        assert type(num) == int
    
    final_results  = ???
    
    while num>0:
        page = self.scrape_page()
        num -= 1
        if [page is empty]:
            break
        # store results
        # ....
    
    return final_results

PS. We need to pass all as a string or something else, because all is a reserved keyword in Python: https://www.w3schools.com/python/ref_func_all.asp


TL;DR:

  1. Use num = float('inf') to fetch all possible pages.
  2. Break the loop when you detect that no more pages are found.
C-3PO
  • 1,181
  • 9
  • 17