Why am I not able to scrape all items in a page?

Question

I'm trying to scrape the hrefs of each house in this website: https://www.zapimoveis.com.br/venda/imoveis/ma+sao-jose-de-ribamar/. The problem is that the page has 150 houses, but my code only scrape 15 houses per page. I don't know if the problem is my xpaths or my code.

This is the code:

def parse(self, response):

hrefs = response.css('a.result-card ::attr(href)').getall()



for url in hrefs:

yield response.follow(url, callback=self.parse_imovel_info,

dont_filter = True

)



def parse_imovel_info(self, response):



zap_item = ZapItem()



imovel_info = response.css('ul.amenities__list ::text').getall()

tipo_imovel = response.css('a.breadcrumb__link--router ::text').get()

endereco_imovel = response.css('span.link ::text').get()

preco_imovel = response.xpath('//li[@class="price__item--main text-regular text-regular__bolder"]/strong/text()').get()

condominio = response.xpath('//li[@class="price__item condominium color-dark text-regular"]/span/text()').get()

iptu = response.xpath('//li[@class="price__item iptu color-dark text-regular"]/span/text()').get()

area = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="floorSize"]::text').get()

num_quarto = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="numberOfRooms"]::text').get()

num_banheiro = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="numberOfBathroomsTotal"]::text').get()

num_vaga = response.xpath('//ul[@class="feature__container info__base-amenities"]/li[@class="feature__item text-regular js-parking-spaces"]/span/text()').get()

andar = response.xpath('//ul[@class="feature__container info__base-amenities"]/li').css('span[itemprop="floorLevel"]::text').get()

url = response.url

id = re.search(r'id-(\d+)/', url).group(1)



filtering = lambda info: [check if info == check.replace('\n', '').lower().strip() else None for check in imovel_info]



lista = {

'academia': list(filter(lambda x: "academia" in x.lower(), imovel_info)),

'piscina': list(filter(lambda x: x != None, filtering('piscina'))),

'spa': list(filter(lambda x: x != None, filtering('spa'))),

'sauna': list(filter(lambda x: "sauna" in x.lower(), imovel_info)),

'varanda_gourmet': list(filter(lambda x: "varanda gourmet" in x.lower(), imovel_info)),

'espaco_gourmet': list(filter(lambda x: "espaço gourmet" in x.lower(), imovel_info)),

'quadra_de_esporte': list(filter(lambda x: 'quadra poliesportiva' in x.lower(), imovel_info)),

'playground': list(filter(lambda x: "playground" in x.lower(), imovel_info)),

'portaria_24_horas': list(filter(lambda x: "portaria 24h" in x.lower(), imovel_info)),

'area_servico': list(filter(lambda x: "área de serviço" in x.lower(), imovel_info)),

'elevador': list(filter(lambda x: "elevador" in x.lower(), imovel_info))

}



for info, conteudo in lista.items():

if len(conteudo) == 0:

zap_item[info] = None

else:

zap_item[info] = conteudo[0]



zap_item['valor'] = preco_imovel,

zap_item['tipo'] = tipo_imovel,

zap_item['endereco'] = endereco_imovel.replace('\n', '').strip(),

zap_item['condominio'] = condominio,

zap_item['iptu'] = iptu,

zap_item['area'] = area,

zap_item['quarto'] = num_quarto,

zap_item['vaga'] = num_vaga,

zap_item['banheiro'] = num_banheiro,

zap_item['andar'] = andar,

zap_item['url'] = response.url,

zap_item['id'] = int(id)

yield zap_item

Can someone help me?

Indentation is important in Python, but the code you have posted has none. There's no way to determine whether this code works as intended. — Tangentially Perpendicular, Jul 29 '23 at 01:54
"The problem is that the page has 150 houses, but my code only scrape 15 houses per page." When I look at the page in my browser, I only see 15 houses. The reason is that I have NoScript installed, and I have not been to this site before, so Javascript is disabled. This shows that the other listings require running Javascript to display. Please see the linked duplicate for this extremely common problem with scraping. — Karl Knechtel, Jul 29 '23 at 02:07

score -2 · Answer 1 · answered Jul 29 '23 at 02:03

According to the provided code, it appears that you are extracting data from the specified website using a web scraping framework (perhaps Scrapy). You're having trouble because there are 150 properties on the website overall, but your code is only scraping 15 houses every page.

The website is paginated, so the houses are dispersed over several pages, and your code is only scraping the first page (which has 15 houses), which is the most likely cause of this result. You must add pagination to your spider in order to scrape all 150 homes.

The general solution to this issue is as follows:

Determine the pagination URL pattern: Check out the pagination on the webpage. When you move to the subsequent page, check the URL to see if there are any patterns that alter with each page.

Make your spider more pagination-friendly: Update your spider to scrape data from each page by following the pagination links. Your parse method might need to be updated to accommodate the pagination logic.

Here is an illustration of how to handle pagination in your code:

    import scrapy

class MySpider(scrapy.Spider):
    name = 'zap_spider'
    start_urls = ['https://www.zapimoveis.com.br/venda/imoveis/ma+sao-jose-de-ribamar/']

    def parse(self, response):
        # Scrape hrefs from the current page
        hrefs = response.css('a.result-card ::attr(href)').getall()
        for url in hrefs:
            yield response.follow(url, callback=self.parse_imovel_info, dont_filter=True)

        # Check if there's a next page and follow it
        next_page_url = response.css('a.pagination__item--next ::attr(href)').get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse, dont_filter=True)

    def parse_imovel_info(self, response):
        # Your parsing logic remains the same
        # ...

Why am I not able to scrape all items in a page?

1 Answers1