1

I'm trying to scrape paginated web, but it gives me the first page in every iteration. When I click it in the browser, the content is different.

url = "http://www.x.y/z/a-b#/page-%s"

for i in range(1, 10):
  url2 = url % str(i)
  soup = urlToSoup(url2)
  print url2
  # url2 changes in every iteration
  # Here it will print the same product list in every iteration

This is the output:

http://www.x.y/z/a-b#/page-1
http://www.x.y/z/a-b#/page-2
http://www.x.y/z/a-b#/page-3
http://www.x.y/z/a-b#/page-4
http://www.x.y/z/a-b#/page-5
http://www.x.y/z/a-b#/page-6
http://www.x.y/z/a-b#/page-7
http://www.x.y/z/a-b#/page-8
http://www.x.y/z/a-b#/page-9

The pager item for the page 2 (and similarly 3, 4, ...) looks as follows

<a rel="nofollow" href="http://www.x.y/z/a-b#/page-2"> <span>2</span> </a>

Why the resulting page is different when I open the URL (via click or via address bar) in the browser and when I get it via the code?

xralf
  • 3,312
  • 45
  • 129
  • 200
  • 1
    You are reusing the `url` variable, so the second time you go through the loop you don't have the `%s` there to be substituted by `str(i)`. That should cause an error though, aren't you getting one? – Paulo Almeida Aug 10 '17 at 19:53
  • @PauloAlmeida "url changes in every iteration". This would be too easy when not. (That's the reason I print it, to be sure). – xralf Aug 10 '17 at 19:57
  • I'm not sure I understand. `url` changes in a way that is not depicted in the code above? How exactly is it changing then? – Paulo Almeida Aug 10 '17 at 20:08
  • @PauloAlmeida You're absolutely right. I don't know, how could I overlook it. I will use another variable in a loop. I will delete the question. Thank you. – xralf Aug 10 '17 at 20:36
  • @PauloAlmeida In the end, I rather edited the question, because the problem still remains, though url2 changes. (For the first time, I used urlToSoup(url % str(i))), than I changed it for question purpose and made the mistake you noticed in the first comment. So, we could delete all the comments and start anew. – xralf Aug 10 '17 at 20:44
  • Ok, then the next question is what is in `urlToSoup`. I would run all the steps within it, with two different URLs, to see what is happening. Maybe there's some Javascript generating the pages, in which case you could try [selenium](http://docs.seleniumhq.org/) or using the browser developer tools to investigate what calls are being made to get the content. – Paulo Almeida Aug 10 '17 at 20:59
  • @PauloAlmeida urlToSoup, it was my helper, generating boilerplate code, I'm lazy to write everytime. Dan-dev solved it completely with great explanation. – xralf Aug 10 '17 at 22:16

1 Answers1

1

You are adding text to the "Fragment Identifier" (i.e. after a #) see https://www.w3.org/DesignIssues/Fragment.html

The fragment identifier is a string after URI, after the hash, which identifies something specific as a function of the document. For a user interface Web document such as HTML poage, it typically identifies a part or view. For example in the object

RFC3986 says

the fragment identifier is separated from the rest of the URI prior to a dereference, and thus the identifying information within the fragment itself is dereferenced solely by the user agent, regardless of the URI scheme. Although this separate handling is often perceived to be a loss of information, particularly for accurate redirection of references as resources move over time, it also serves to prevent information providers from denying reference authors the right to refer to information within a resource selectively. Indirect referencing also provides additional flexibility and extensibility to systems that use URIs, as new media types are easier to define and deploy than new schemes of identification.

So you are adding you index to a part of a URL that is not sent to the server. It is for client side use only "dereferenced solely by the user agent". The server is seeing the same URL every iteration.

The way the page is most likely rendered is that there is some JavaScript reading the fragment identifier and making another request to get the data or determining which part of the data to display.

I suggest examining all the requests the page makes using Live HTTP Headers or some other tool to see if there is a second request you can utilise or use a JavaScript rendering technology like Selenium, dryscrape or PyQT5, see my answer to Scraping Google Finance (BeautifulSoup) for details.

Community
  • 1
  • 1
Dan-Dev
  • 8,957
  • 3
  • 38
  • 55