7

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

But how can I get all references to javascript style files and all URLs that are in the source code of a website? I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.

Supposing you want to get all the tags on this page for example:

view-source:https://www.nike.com/

How can I get all script tags and return to console? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it

  • 1
    Bounties are a way of using reputation to advertise questions, but be forewarned: you lose the rep immediately, with few chances to get it back. – Heretic Monkey Jun 01 '21 at 00:24
  • Stack overflow isn't a code writing service. Show us your own research first please, and what works and what issues you run against. – Tschallacka Jun 03 '21 at 07:23
  • As site you mean 1 particual link (e.g. `google.com`) or all sublinks (e.g. `google.com` and `google.com/something` ect.) also? – ulou Jun 03 '21 at 08:05
  • @Tschallacka I don't have a code, I didn't find something explaining and I asked the stack overflow to get an answer, I didn't find what I was looking for –  Jun 03 '21 at 13:13
  • @ulou I want to get all links and sublinks and linking from css javascript file etc, I want to be able to get all links and sublinks that are visible in the source code –  Jun 03 '21 at 13:15
  • https://stackoverflow.com/questions/48864589/how-to-scrape-multi-level-links-using-puppeteer-js – ulou Jun 03 '21 at 13:16
  • @ulou I went to see the answer and there was a problem, and the site was down, and when I change the URL it returns an empty array, as would the script if I wanted to get all the links that are in the source code of the google.com site by example? –  Jun 03 '21 at 14:22
  • @ulou I already edited –  Jun 04 '21 at 00:41

3 Answers3

6

It is possible to get all links from a URL using only node.js, without puppeteer:

There are two main steps:

  1. Get the source code for the URL.
  2. Parse the source code for links.

Simple implementation in node.js:

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.com');

// Extract html source from response, then process it:
promise.then(function(response) {
    htmlSource = response.data
    getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
    // Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
    LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;

    // Use the regular expression from above to find all the links:
    matches = htmlString.match(LINK_REGEX);

    // Output to console:
    console.log(matches);

    // Alternatively, return the array of links for further processing:
    return matches;
}

Sample usage:

$ node get-links.js
[
    'http://www.w3.org/2000/svg',
    ...
    'https://s3.nikecdn.com/unite/scripts/unite.min.js',
    'https://www.nike.com/android-icon-192x192.png',
    ...
    'https://connect.facebook.net/',
... 658 more items
]

Notes:

Leftium
  • 16,497
  • 6
  • 64
  • 99
2

Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.

Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  const pageUrls = await page.evaluate(() => {
    const urlArray = Array.from(document.links).map((link) => link.href);
    const uniqueUrlArray = [...new Set(urlArray)];
    return uniqueUrlArray;
  });

  console.log(pageUrls);
 
  await browser.close();
})();
Dan Pavlov
  • 149
  • 5
1

yes you can get all the script tags and their links without opening view source. You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below

here is the code:

const axios = require('axios');
const jsdom = require("jsdom");

// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');

// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document

// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)

Here I have made CSS selector for <script> tags that have src attribute inside them.

You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.

you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.

dangi13
  • 1,275
  • 1
  • 8
  • 11