1

I got this project where I'm scraping data on Trulia.com and where I want to get the max number of page (last number) for a specific location (photo below) so I can loop through it and get all the hrefs.

enter image description here

To get that last number, I have my code that run as planned and should return an integer but it doesn't always return the same number. I added the print(comprehension list) to understand what's wrong. Here is the code and the output below. The return is commented but sould return the last number of the output list as an int.

city_link = "https://www.trulia.com/for_rent/San_Francisco,CA/"

def bsoup(url):
    resp = r.get(url, headers=req_headers)
    soup = bs(resp.content, 'html.parser')
    return soup

def max_page(link):
    soup = bsoup(link)
    page_num = soup.find_all(attrs={"data-testid":"pagination-page-link"})
    print([x.get_text() for x in page_num])
#     return int(page_num[-1].get_text())

for x in range(10):
    max_page(city_link)

enter image description here

I have no clue why sometimes it's returning something wrong. The photo above is the corresponding link.

Marc
  • 522
  • 1
  • 5
  • 15
  • Does this answer your question? [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – accdias Aug 08 '21 at 15:13
  • I don't think so. I'm not sure but don't think it has to do anything with JS. Why would, the same function with the same parameter run 10 times would give two different results. – Marc Aug 08 '21 at 16:30

1 Answers1

0

Okay, now if I understand what you want, you are trying to see how many pages of links there are for a given location for rent. If we can assume the given link is the only required link, this code:

import requests
import bs4

url = "https://www.trulia.com/for_rent/San_Francisco,CA/"

req = requests.get(url)
soup = bs4.BeautifulSoup(req.content, features='lxml')

def get_number_of_pages(soup):
    caption_tag = soup.find('div', class_="Text__TextBase-sc-1cait9d-0- 
                        div Text__TextContainerBase-sc-1cait9d-1 RBSGf")
    pagination = caption_tag.text
    words = pagination.split(" ")
    values = []
    for word in words:
        if not word.isalpha():
            values.append(word)
    links_per_page = values[0].split('-')[1]
    total_links = values[1].replace(',', '')
    no_of_pages = round(int(total_links)/int(links_per_page) + 0.5)
    return no_of_pages

for i in range(10):
    print(get_number_of_pages(soup))

achieves what you're looking for, and has repeatability because it doesn't interact with javascript, but the pagination caption at the bottom of the page.

Austin
  • 159
  • 2
  • 9