0

I'm scraping this website using Python and Selenium. But it currently only scrapes the first 10 page for the month of July, it turns the page number of the previous sibling of the next button into int and clicks next number_of_pages - 1 however after it gets to page 10 it stops.

URL - https://planning.adur-worthing.gov.uk/online-applications/search.do?action=monthlyList

Can anyone help me to get it to scrape all the pages?

def pagination( driver ):
   data = []
   last_element = driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( @class ), " "), " next ") ]/preceding-sibling::a[1]')
   if last_element is None:
    number_of_pages = 1
else:
    number_of_pages = int( last_element.text )
# data = [ getData( driver ) ]
data.extend(getData(driver))
for i in range(number_of_pages - 1):
    driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( @class ), " "), " next ") ]').click()
    data.extend( getData( driver ) )
    time.sleep(1)
return data
Abdul Jamac
  • 127
  • 1
  • 1
  • 10
  • can you print number_of_pages before the for loop? I suspect that because you convert the text of the last element to int, it just shows 10 (even though there are more pages) – NotSoShabby Aug 23 '18 at 14:11
  • I just tested this out your right it only turns 10 into int it doesnt carry on for the other pages – Abdul Jamac Aug 23 '18 at 14:13
  • as per your given link [URL - https://planning.adur-worthing.gov.uk/online-applications/search.do?action=monthlyList] . I am seeing only 10 pages. – Vardhman Patil Aug 23 '18 at 14:18
  • are you checking the month july if you are press page 10 and more should come up – Abdul Jamac Aug 23 '18 at 14:23

3 Answers3

1

number_of_pages seems to have the value of 10.

Find another way to find out how many pages there are.

You can use a while loop that checks if the "next page" button is available, and if it is, keep going, else- that is the last page.

like this:

while next_button_element.is_displayed():
    // Do the action that is currently in the for loop
NotSoShabby
  • 3,316
  • 9
  • 32
  • 56
  • Do you mean like this: next_button_element = driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( @class ), " "), " next ") ]') while next_button_element.is_displayed(): driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( class ), " "), " next ") ]').click() data.extend( getData( driver ) ) time.sleep(1) return data – Abdul Jamac Aug 23 '18 at 14:29
  • Use more simple selectors: next button css selector `driver.find_elements_by_css_selector('a.next')` – Sers Aug 23 '18 at 14:31
  • No need to find the element twice. find it once and store it in a variable and then use is_displayed() or click() function on it – NotSoShabby Aug 23 '18 at 14:32
  • This doesnt work either next_button_element = driver.find_elements_by_css_selector('a.next') while next_button_element.is_displayed(): next_button_element.click() data.extend( getData( driver ) ) time.sleep(1) return data – Abdul Jamac Aug 23 '18 at 14:40
1

Code you can use:

while True:
    data.extend(getData(driver))
    try:
        driver.find_element_by_css_selector('a.next').click()
    except:
        break
Sers
  • 12,047
  • 2
  • 12
  • 31
  • got this error next_button_by = (By.CSS_SELECTOR, "a.next") NameError: global name 'By' is not defined – Abdul Jamac Aug 23 '18 at 14:54
  • add `from selenium.webdriver.common.by import By` – Sers Aug 23 '18 at 14:59
  • if driver.find_elements(next_button_by)==0: this line gave the error: WebDriverException: Message: invalid argument: 'using' must be a string – Abdul Jamac Aug 23 '18 at 15:02
  • Missed get count `.count` or use `len(driver.find_elements(next_button_by))` – Sers Aug 23 '18 at 15:06
  • it works thank you however how do i get it to stop printing this error when a.next doesn't exists NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"a.nex t"} – Abdul Jamac Aug 23 '18 at 15:57
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/178627/discussion-between-sers-and-abdul-jamac). – Sers Aug 23 '18 at 16:03
0

Look, I understand you took the idea of calculating the total number of pages from my answer for a previous question of yours. In the previous case since the last page number was directly available to us, it worked but that's not the case here.

Solution :

Although the number of pages is not directly available but the total number of entries is -

Image displaying the total number of entries

Now, as you can see in the above screenshot for the case of July this number is 174. Assuming you put the pagination length(the number of entries in a single page) as default 10, the number of pages should be 18 (17 pages of 10 entries each and one extra page for remaining 4 entries).

So, the logic of calculating the number of pages should be simple. If you somehow got this total number of entries in total_entries variable, the number of pages should be(taken from this:

number_of_pages = (total_entries/10) + 1

Python by default returns the lower bound integer by division operator so 174/10 will return 17 and adding +1 will return 18. So there you have it- 18 as the number of pages.

Now, to extract the total number of entries. You use the below locator to find the <span> element holding that.

driver.find_element_by_xpath('//span[@class='showing']')

But this element contains text like this - Showing 1-10 of 174. You need only the 174 part from the entire string. To do that, first you extract the string after "of" and then convert it into int.

Algorithm to extract the total number of entries as int from the text:

showing_text = driver.find_element_by_xpath("//span[@class='showing']").text    #Showing 1-10 of 174
number_of_entries_text = showing_text.split("of",1)[1]        # 174 as text
number_of_entries = int( re.findall(r'\d+',number_of_entries_text)[0])  #174 as int
number_of_pages = (number_of_entries/10) + 1   #18

Final code:

def pagination( driver ):
   data = []
   last_element = driver.find_element_by_xpath("//span[@class='showing']")
   if last_element is None:
      number_of_pages = 1
   else:
      showing_text = driver.find_element_by_xpath("//span[@class='showing']").text              number_of_entries_text = showing_text.split("of",1)[1]        
      number_of_entries = int( re.findall(r'\d+',number_of_entries_text)[0])  
      number_of_pages = (number_of_entries/10) +1   

   for i in range(number_of_pages - 1):
       driver.find_element_by_xpath('//a[ contains( concat( " ", normalize-space( @class ), " "), " next ") ]').click()
       time.sleep(1)

Note:

I think my solution is better since you don't have to repeatedly check for any element to be available or to catch any exceptions. You just directly get the number of pages and you click the next button that many times.

Shivam Mishra
  • 1,731
  • 2
  • 11
  • 29
  • math.cecil rounds it down to the smallest integer so that means it would skip page 18 – Abdul Jamac Aug 23 '18 at 16:53
  • if there is a way to get it to also go to page 18 that would be great – Abdul Jamac Aug 23 '18 at 17:12
  • @AbdulJamac I am sorry I made it more complicated than it was necessary. Python by default returns lower bound int on division operator so there is no need of math.ceil. Check my edited answer. Just dividing the total number of entries by 10 and adding 1 to that will do the trick. And yes that way, it will go all the way to the end i.e. 18. – Shivam Mishra Aug 23 '18 at 17:18