Unable to access the whole content of the downloaded html file

Question

My original task is to download multiple scientific publications as html file. Currently my script downloads a file in chrome but it takes to the url in firefox. But that is not my questions.

If you will see the downloaded html source, you will find that not all content has got downloaded. Only some of the content shows up in the downloaded html file. That is my problem. Why I am not able to get the whole html document content in the downloaded html file. The file I want to download is this

var links = [
      'http://www.sciencedirect.com/science/article/pii/S2078152015000516'
];

I thought probably it is because of CORS issue. But, after implementing CORS script, it was still showing the partially downloaded content in the responseText.

Any assistance will be appreciated.

Also, if someone can tell me why in firefox, the script does not downloads the file and takes me to the url instead.

most likely because the page you are trying to download, loads it's content dynamically on scrolldown like facebook. and so you only get to download the part that is loaded with the page. **edit:** i just opened the link to your article and thats exactly what it does. — Banana, Feb 15 '16 at 14:07
in addition to @Banana answer, "Firefox only supports same-origin download links." this is the reason firefox is not working. — Vitaliy Terziev, Feb 15 '16 at 14:12
if you view the source of the page, you can locate a hidden error message box which is supposed to pop up if you have javascript disabled. in that error box, you are offered a link to the full article which does Not load its content dynamically: **http://www.sciencedirect.com/science/article/pii/S2078152015000516?np=y** — Banana, Feb 15 '16 at 14:18
@Banana Javascript is enabled on my chrome. I have also enabled the pop up. But I could not see any pop up or error message appear on the button click. However, when I tried to open the link as you provided, I saw it has downloaded all the content. I can see that there is a hidden input field called targetURL, which has this url? However, I am still looking into how can I know this internal urls without actually digging the code and what is stopping it to load the whole content at first place. Could you assist on that as well. — user3050590, Feb 15 '16 at 18:43
@Banana Also, please put your comment in the answer to the problem. As now I am able to see the whole content. — user3050590, Feb 15 '16 at 18:44
@user3050590 you can only see the link if you have javascript disabled. thats the whole point of the message box... — Banana, Feb 15 '16 at 19:11

score 1 · Accepted Answer · edited May 23 '17 at 11:45

The reason why you are unable to download the entire page, is because the page only loads half way, and the rest is added dynamically once you scroll down.
Therefore, when you try to download the page, you only receive the initially loaded half without the dynamic part.

since it is done using javascript, this particular website offers you an alternative in case you have javascript disabled and do not want to/cant enable it (like with a reader):
If you view the source of the page, you can locate the following message box at the very beginning of the body:

<div class="ua_btn" role="region" aria-label="screen reader compatability">
    <a role="button" rel="nofollow" href="http://www.sciencedirect.com/science/article/pii/S2078152015000516?np=y">
        Screen reader users, click here to load entire article
    </a> 
    This page uses JavaScript to progressively load the article content as a user scrolls.
    Screen reader users, click the load entire article button to bypass dynamically loaded article content.
</div>

here you are offered a link with a query part "np=y" which overrides the dynamic loading and initializes the whole page right away:

http://www.sciencedirect.com/science/article/pii/S2078152015000516?np=y

use this link in order to download the artice and it will work.

Firefox:
As mentioned in the comments, firefox does not support CORS downloads by design due to potential security risks. more about it can be found Here

Unable to access the whole content of the downloaded html file

1 Answers1