0

I've read some relevant posts here but couldn't figure an answer.

I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.

I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:

Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all) 
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
        headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
        headers={all_headers}, cookies={all_cookies})

But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?

Alex K.
  • 835
  • 6
  • 15

2 Answers2

1

What you need is a headless browser for this since request module can not handle AJAX well.

One of such headless browser is selenium.

i.e.)

driver.find_element_by_id("show more").click() # This is just an example case
Steve
  • 46
  • 5
  • Well, I've used selenium + phantomjs before but it's relatively slow. Are you sure here isn't a better way? – Alex K. Jan 13 '16 at 11:17
  • @AlexK. there are other ways see this http://stackoverflow.com/questions/16390257/scraping-ajax-pages-using-python but I don't know about navigation and all. – Steve Jan 13 '16 at 11:24
  • Thank you. I've found a mistake in my code - actually, I missed 'x-requested-with': 'XMLHttpRequest' line in my headers and noboby could notice it since I didn't provide this part of code... Since your answers suggests another appropriate way of solving the problem I'm marking it as a solution. – Alex K. Jan 13 '16 at 11:38
  • @AlexK. this what happens when you provide a sample of your code some time you bay forget the important snippet and also try to provide URl for other to check it :-) – Steve Jan 13 '16 at 11:42
1

Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.

You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.

Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Hao Lyu
  • 176
  • 1
  • 5
  • Thank you! Could you advice me how to organize my program? – Alex K. Jan 16 '16 at 12:35
  • It's pretty simple now. A script parses sites completely (all links and pages) once a day to get fresh reviews. But I realize that it's not the best way: 1.) I'm getting fresh reviews with a day delay; 2.) Probably parsing all the information from sites every time isn't necessary. Now I'm thinking about sending head requests once an hour to first site pages only and parse only first pages and only when they has been changed. Is it a good approach? Wouldn't my head requests disturb sites too much? Are there better ways? Thank you! – Alex K. Jan 16 '16 at 12:46
  • I think it will work, since you don't send too many requests. – Hao Lyu Jan 19 '16 at 21:22