-1

My spider isn't crawling horizontaly, and I can't figure out why.

parse_item function is working pretty well on the first page. I've checked the xpath of next_page in the scrapy shell, and it is correct.

Could you please check my code?

The website I'm trying to scrape is this

import scrapy
import datetime
import socket

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from itemloaders.processors import MapCompose
from properties.items import PropertiesItem


class EasySpider(CrawlSpider):
    name = 'easy'
    allowed_domains = ['www.vivareal.com.br']
    start_urls = ['https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/']

    next_page = '//li[@class="pagination__item"][last()]'

    rules = (
        Rule(LinkExtractor(restrict_xpaths=next_page)),
        Rule(LinkExtractor(allow=r'/imovel/', 
                            deny=r'/imoveis-lancamento/'),
                            callback='parse_item'),
    )

    def parse_item(self, response):
        l = ItemLoader(item=PropertiesItem(), response=response)
        l.add_xpath('url', 'a/@href', )
        l.add_xpath('tipo', '//h1/text()',
                    MapCompose(lambda x: x.strip().split()[0]))
        l.add_xpath('valor', '//h3[@class="price__price-info js-price-sale"]/text()',
                    MapCompose(lambda x: x.strip().replace('R$ ', '').replace('.', ''), float))
        l.add_xpath('condominio', '//span[@class="price__list-value condominium js-condominium"]/text()',
                    MapCompose(lambda x: x.strip().replace('R$ ', '').replace('.', ''), float))
        l.add_xpath('endereco', '//p[@class="title__address js-address"]/text()',
                    MapCompose(lambda x: x.split(' - ')[0]))
        l.add_xpath('bairro', '//p[@class="title__address js-address"]/text()',
                    MapCompose(lambda x: x.split(' - ')[1].split(',')[0]))
        l.add_xpath('quartos', '//ul[@class="features"]/li[@title="Quartos"]/span/text()',
                    MapCompose(lambda x: x.strip(), int))
        l.add_xpath('banheiros', '//ul[@class="features"]/li[@title="Banheiros"]/span/text()',
                    MapCompose(lambda x: x.strip(), int))
        l.add_xpath('vagas', '//ul[@class="features"]/li[@title="Vagas"]/span/text()',
                    MapCompose(lambda x: x.strip(), int))
        l.add_xpath('area', '//ul[@class="features"]/li[@title="Área"]/span/text()',
                    MapCompose(lambda x: x.strip(), float))
        l.add_value('url', response.url)
        
        # Housekeeping fields
        l.add_value('project', self.settings.get('BOT_NAME'))
        l.add_value('spider', self.name)
        l.add_value('server', socket.gethostname())
        l.add_value('date', datetime.datetime.now())
        
        return l.load_item()

UPDATE

Searching the log I found this about the horizontal crawl:

2021-02-22 17:09:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 17:09:24 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

It seems like next page is been duplicate, but I don't know how to fix it.

In addition, I realize that, dispite the href points to #pagina=2, the actual url is ?pagina=2.

Any hints?

Marcelo
  • 26
  • 7

1 Answers1

1

Actually your spider is not crawling even the first page.

The problem resides in allowed_domains parameter. Change it to

allowed_domains = ['www.vivareal.com.br']

and you will start to crawl. After that change you will get a lot of errors (exceptions thrown by the code due to logical errors, as I saw here), but your code will be running as intended.

EDIT (2):

check the LOGs:

2021-02-22 13:36:19 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.vivareal.com.br': <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2>

Basically the allowed_domains is not properly set, as explained here and on this old question.

EDIT: to make it clear: the log I get after running the spider as defined on the question is:


2021-02-22 13:29:18 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: properties)
2021-02-22 13:29:18 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.1 (default, Feb  9 2020, 21:34:32) - [GCC 7.4.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform Linux-4.15.0-135-generic-x86_64-with-glibc2.27
2021-02-22 13:29:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-02-22 13:29:18 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'properties',
 'NEWSPIDER_MODULE': 'properties.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['properties.spiders']}
2021-02-22 13:29:18 [scrapy.extensions.telnet] INFO: Telnet Password: 3790c3525890efea
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-02-22 13:29:18 [scrapy.core.engine] INFO: Spider opened
2021-02-22 13:29:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-02-22 13:29:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-22 13:29:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/robots.txt> (referer: None)
2021-02-22 13:29:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/> (referer: None)
2021-02-22 13:29:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.vivareal.com.br': <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2>
2021-02-22 13:29:20 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-22 13:29:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 156997,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.87473,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 2, 22, 16, 29, 20, 372722),
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'memusage/max': 54456320,
 'memusage/startup': 54456320,
 'offsite/domains': 1,
 'offsite/filtered': 34,
 'request_depth_max': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 2, 22, 16, 29, 18, 497992)}
2021-02-22 13:29:20 [scrapy.core.engine] INFO: Spider closed (finished)

and when I run with the proposed changes the log is this (adapted to not show my paths):

2021-02-22 13:31:47 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: properties)
2021-02-22 13:31:47 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.1 (default, Feb  9 2020, 21:34:32) - [GCC 7.4.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform Linux-4.15.0-135-generic-x86_64-with-glibc2.27
2021-02-22 13:31:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-02-22 13:31:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'properties',
 'NEWSPIDER_MODULE': 'properties.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['properties.spiders']}
2021-02-22 13:31:47 [scrapy.extensions.telnet] INFO: Telnet Password: 65a5f31c8dda80fa
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-02-22 13:31:47 [scrapy.core.engine] INFO: Spider opened
2021-02-22 13:31:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-02-22 13:31:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-22 13:31:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/robots.txt> (referer: None)
2021-02-22 13:31:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/> (referer: None)
2021-02-22 13:31:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-1-quartos-funcionarios-bairros-belo-horizonte-com-garagem-41m2-venda-RS330000-id-2510414426/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-nova-granada-bairros-belo-horizonte-com-garagem-74m2-venda-RS499000-id-2509923918/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-4-quartos-serra-bairros-belo-horizonte-com-garagem-246m2-venda-RS1950000-id-2510579983/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/casa-3-quartos-sao-geraldo-bairros-belo-horizonte-com-garagem-120m2-venda-RS460000-id-2484383176/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-4-quartos-savassi-bairros-belo-horizonte-com-garagem-206m2-venda-RS1790000-id-2503711314/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-2-quartos-paqueta-bairros-belo-horizonte-com-garagem-60m2-venda-RS260000-id-2479637684/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-savassi-bairros-belo-horizonte-com-garagem-107m2-venda-RS1250000-id-2506122689/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.vivareal.com.br/imovel/apartamento-1-quartos-funcionarios-bairros-belo-horizonte-com-garagem-41m2-venda-RS330000-id-2510414426/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spiders/crawl.py", line 114, in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/home/leomaffei/properties/properties/spiders/spider.py", line 28, in parse_item
    l.add_xpath('url', 'a/@href', )
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 350, in add_xpath
    self.add_value(field_name, values, *processors, **kw)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 190, in add_value
    self._add_value(field_name, value)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 208, in _add_value
    processed_value = self._process_input_value(field_name, value)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 312, in _process_input_value
    proc = self.get_input_processor(field_name)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 290, in get_input_processor
    proc = self._get_item_field_attr(
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 308, in _get_item_field_attr
    field_meta = ItemAdapter(self.item).get_field_meta(field_name)
  File "/usr/lib/python3.8/site-packages/itemadapter/adapter.py", line 235, in get_field_meta
    return self.adapter.get_field_meta(field_name)
  File "/usr/lib/python3.8/site-packages/itemadapter/adapter.py", line 161, in get_field_meta
    return MappingProxyType(self.item.fields[field_name])
KeyError: 'url'
2021-02-22 13:31:50 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-nova-granada-bairros-belo-horizonte-com-garagem-74m2-venda-RS499000-id-2509923918/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filte
...

Leonardo Maffei
  • 352
  • 2
  • 6
  • 16
  • I made the changes, but it's still crawling only page one. I checked the log, and it appears that the spider is capturing the next page but the server is returning the original page: `2021-02-22 11:09:35 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/) ` – Marcelo Feb 22 '21 at 14:14
  • @Marcelo isn't that because the exceptions raised? I suppose the exception thrown at **l.add_xpath('url', 'a/@href', )** causes the whole code below it to fail and then the **return l.load_item()** is not executed. Maybe this could be the cause afterall? – Leonardo Maffei Feb 22 '21 at 15:23
  • what exceptions? checking the log of my crawler I found no such exceptions. It captures the `url` field correctly. – Marcelo Feb 22 '21 at 15:53
  • as I said, the code parses all first page, capture the link to all property pages, and save the itens correctly. It cannot, however, go horizontaly, to page 2. – Marcelo Feb 22 '21 at 15:54
  • I edited my answer to better make me understand – Leonardo Maffei Feb 22 '21 at 16:35
  • @Marcelo says the log ```2021-02-22 13:36:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ``` – Leonardo Maffei Feb 22 '21 at 16:37
  • it's very strange, because this is not at all what happens here. I commented the `l.add_xpath('url', 'a/@href', )` line but it still only scrape the first page. – Marcelo Feb 22 '21 at 16:42
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229049/discussion-between-leonardo-maffei-and-marcelo). – Leonardo Maffei Feb 22 '21 at 16:43