Python Scrape with requests and beautifulsoup

Question

I am trying to do scraping excise using python requests and beautifulsoup. Basically i am crawling amazon web page. I am able to crawl the first page without any issues.

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing

But when I try to crawl the 2nd page with "#2" in urls

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")

I see r still has same value that is equivalent to the value of 1 page.

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")

Dont know is #2 causing any trouble while making request to second page. I also google about the issues but I could not find a fix. What is right way to make request to url with #values. How to address this issue. Please advice.

score 1 · Answer 1 · edited May 23 '17 at 12:31

"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".

In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:

enter image description here

Looks like our url is:

http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj

Easily we can understand that all we need is to change "pg" param value to get another pages.

score 0 · Accepted Answer · answered May 25 '15 at 10:26

You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:

http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1

and the second page's url is like this:

http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2

a tag for the second page is like this:

<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>

So you need to change the request url.

Python Scrape with requests and beautifulsoup

2 Answers2