I would like to access all of the items in a given category inside amazon, but it seems that the category pages are generated via search. Bumping the page search parameter in the URL will only take you to the 100th page. Is there any way to get past that? Here's a sample url for books
Asked
Active
Viewed 977 times
1 Answers
1
The content is loaded dynamically using ajax XHR call.
Long story short:
- open browser dev tools
- open network tab
- click on the page link on amazon
- see XHR request is going to
http://www.amazon.com/mn/search/ajax/ref=sr_pg_3...
- this is what you should call in your Scrapy spider (returns JSON)
So, basically, you should just call this XHR request 100 times (or find out if you can get them all in one).
Useful links:
- Can scrapy be used to scrape dynamic content from websites that are using AJAX?
- Pagination using scrapy
Notes:
- amazon limits search results to 100 pages
- you can try amazon API instead of scraping web-site directly. See Amazon API library for Python?.
Hope that helps.
-
thanks for the tip, that was helpful. Taking a look at those two links you shared. As for the xhr request, it looks pretty nasty, as the JSON results actually contain the page's HTML. I try bumping up the two parameters page=101 and ref=sr_pg_100, but results are then empty. Any idea what the rest of parameters are for? – Andres Apr 24 '13 at 23:55
-
It's smth specific to this ajax dataprovider, you probably need just `page`, and may be `sort`. I've added some notes to the answer, see if it helps. – alecxe Apr 25 '13 at 08:26
-
haven't looked at it in a while. Do you have anything? – Andres Oct 28 '14 at 20:49