0

I'm attempting to scrape a website using Node.JS but when scraping the html file the things that appear are script tags injecting JavaScript, upon reviewing the JavaScript file in question it appears I was correct about it as I found the text I was trying to scrape. How can I scrape the document after this script has injected into the html? Is there a way? Thanks

John
  • 41
  • 6

2 Answers2

1

I think you need to use some headless browser, which will evaluate JavaScript like normal web browser does. Then, after page load you can run you own JavaScript on loaded page like you would do in Chrome console window (for example) or access HTML elements.

For node.js there is Puppeteer, which I used several times to scrape data from SPA web apps.

Maksymilian Tomczyk
  • 1,100
  • 8
  • 16
  • What do you mean by very slow? Is it much slower than normal web browser? – Maksymilian Tomczyk Nov 03 '20 at 20:21
  • It's a bit slower than a browser, my main issue with slowness here is that I'm not just scraping the page once, I'm scraping the page, if I see something I go to another page and scrape that over and over till I get my result. The issue with this is that it seems to memory leak even though I'm using `page.close()` after every scrape – John Nov 03 '20 at 20:28
  • You can try with Phantom.JS. Concept of library is same, but Phantom is unmaintained for now. https://phantomjs.org/ – Maksymilian Tomczyk Nov 03 '20 at 20:31
  • I've heard of phantomjs but was not sure because of the whole discontinued stuff and the fact i'd have to download it, its not just an NPM library. Do you know why I may be getting a memory leak with Puppeteer? – John Nov 03 '20 at 20:32
  • Memory leak may be caused by Puppeteer directly, as there is open issue on it;s github: https://github.com/puppeteer/puppeteer/issues/5893 – Maksymilian Tomczyk Nov 03 '20 at 20:46
  • A current workaround appears to be closing the browser session and re-opening it every time I scrape – John Nov 03 '20 at 21:13
0

I am assuming you're goal is to scrape the source of the script tags. <script src="scrape-this-file.js"></script>

I don't have enough details to help you correctly but perhaps scraping the links off the page by searching for any string that starts with src=" and ends with ">. Appending these filenames to a seperate array then scraping them seperately would be the most ideal solution. You can do this by targetting your scraper towards https://hostname.com/scrape-this-file.js (or other specified directory).

Apologies if I did not answer your question as I have to assume what a lot of the issue is as I don't 100% understand what you're scraping for.

Cameron
  • 51
  • 3
  • I appreciate the response but the thing I want to do is scrape the normal webpage after the JS is injected into it, not the JS file :/ – John Nov 03 '20 at 19:07