2

I'm trying to do some extraction with scrapy but it doesn't return the expected html, I don't know what's the problem, if it could be the security of the site or something else, because other pages are returning the correct result.

I'm trying to extract the list of posts at this link http://www.reclameaqui.com.br/busca/?q=estorno&empresa=Netshoes&pagina=2 that is about insatisfaction of customers with services and products, but the html returned with the code above doesn't contain the list of posts, just a simple html almost empty.

Does someone know what could be happening? The problem causing the blocking of the correct extraction?

The code is simple, it is the same from the scrapy tutorial:

I already tried some crawler desktop or online tools and the result is the same.

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["reclameaqui.com.br"]
    start_urls = [
       "http://www.reclameaqui.com.br/busca/q=estorno&empresa=Netshoes&pagina=2"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
Ryan
  • 2,167
  • 2
  • 28
  • 33
Guthierrez
  • 25
  • 4

1 Answers1

3

First of all, you have an error in your start_urls. Replace:

start_urls = [
    "http://www.reclameaqui.com.br/busca/q=estorno&empresa=Netshoes&pagina=2"
]

with:

start_urls = [
   "http://www.reclameaqui.com.br/busca/?q=estorno&empresa=Netshoes&pagina=2"
]

Also, if you would inspect the source of the response, you'll see several more challenges you need to overcome:

  • there is a form that needs to be submitted to proceed
  • form input values are calculated using JavaScript
  • the HTML itself is broken - the form is immediately closed and then the inputs come:

    <body>
    <form method="POST" action="%2fbusca%2f%3fq%3destorno%26empresa%3dNetshoes%26pagina%3d2"/>
    <input type="hidden" name="TS01867d0b_id" value="3"/><input type="hidden" name="TS01867d0b_cr" value=""/>
    <input type="hidden" name="TS01867d0b_76" value="0"/><input type="hidden" name="TS01867d0b_86" value="0"/>
    <input type="hidden" name="TS01867d0b_md" value="1"/><input type="hidden" name="TS01867d0b_rf" value="0"/>
    <input type="hidden" name="TS01867d0b_ct" value="0"/><input type="hidden" name="TS01867d0b_pd" value="0"/>
    </form>
    </body>
    

The first problem is easily solved by using FormRequest.from_response(). The second is a more serious issue and you might get away with a real browser only (look up selenium) - I've tried to use ScrapyJS, but was not able to solve it. The third problem, if not switching to using a real browser, might be solved by allowing BeautifulSoup and it's lenient html5lib parser to fix the HTML.

Here is the above mentioned ideas in Python/Scrapy (not working - getting Connection to the other side was lost in a non-clean fashion error - I suspect not all of the input values/POST parameters were calculated):

from bs4 import BeautifulSoup
import scrapy


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    start_urls = [
       "http://www.reclameaqui.com.br/busca/?q=estorno&empresa=Netshoes&pagina=2"
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse_page, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.8}
                }
            })

    def parse_page(self, response):
        soup = BeautifulSoup(response.body, "html5lib")
        response = response.replace(body=soup.prettify())

        return scrapy.FormRequest.from_response(response,
                                                callback=self.parse_form_request,
                                                url="http://www.reclameaqui.com.br/busca/?q=estorno&empresa=Netshoes&pagina=2",
                                                headers={
                                                    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
                                                })

    def parse_form_request(self, response):
        print(response.body)

For more on selenium and ScrapyJS setup, see:

Also, make sure you follow the rules described on the Terms of Use page.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195