First of all, you have an error in your start_urls
. Replace:
start_urls = [
"http://www.reclameaqui.com.br/busca/q=estorno&empresa=Netshoes&pagina=2"
]
with:
start_urls = [
"http://www.reclameaqui.com.br/busca/?q=estorno&empresa=Netshoes&pagina=2"
]
Also, if you would inspect the source of the response, you'll see several more challenges you need to overcome:
- there is a
form
that needs to be submitted to proceed
- form input values are calculated using JavaScript
the HTML itself is broken - the form
is immediately closed and then the inputs come:
<body>
<form method="POST" action="%2fbusca%2f%3fq%3destorno%26empresa%3dNetshoes%26pagina%3d2"/>
<input type="hidden" name="TS01867d0b_id" value="3"/><input type="hidden" name="TS01867d0b_cr" value=""/>
<input type="hidden" name="TS01867d0b_76" value="0"/><input type="hidden" name="TS01867d0b_86" value="0"/>
<input type="hidden" name="TS01867d0b_md" value="1"/><input type="hidden" name="TS01867d0b_rf" value="0"/>
<input type="hidden" name="TS01867d0b_ct" value="0"/><input type="hidden" name="TS01867d0b_pd" value="0"/>
</form>
</body>
The first problem is easily solved by using FormRequest.from_response()
. The second is a more serious issue and you might get away with a real browser only (look up selenium
) - I've tried to use ScrapyJS
, but was not able to solve it. The third problem, if not switching to using a real browser, might be solved by allowing BeautifulSoup
and it's lenient html5lib
parser to fix the HTML.
Here is the above mentioned ideas in Python/Scrapy (not working - getting Connection to the other side was lost in a non-clean fashion
error - I suspect not all of the input values/POST parameters were calculated):
from bs4 import BeautifulSoup
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
start_urls = [
"http://www.reclameaqui.com.br/busca/?q=estorno&empresa=Netshoes&pagina=2"
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse_page, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.8}
}
})
def parse_page(self, response):
soup = BeautifulSoup(response.body, "html5lib")
response = response.replace(body=soup.prettify())
return scrapy.FormRequest.from_response(response,
callback=self.parse_form_request,
url="http://www.reclameaqui.com.br/busca/?q=estorno&empresa=Netshoes&pagina=2",
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
})
def parse_form_request(self, response):
print(response.body)
For more on selenium
and ScrapyJS
setup, see:
Also, make sure you follow the rules described on the Terms of Use page.