1

I am looking for a command line option to get a webpage, and execute the associated JavaScript code. In other words, call a headless browser via command line.

I can't use wget, it does not load and execute the associated JavaScript:

wget --load-cookies cookies.txt -O /dev/null https://example.com/update?run=1

Use case: we have web pages that read elastisearch indexes, do some data manipulation, and update elastisearch indexes. We'd like to do the update on an hourly basis via a cron job. We don't need to capture anything, e.g. no png capture, no HTML capture. We simply need to load the webpage and execute its JavaScript via a cron job, ideally something like run-headless https://example.com/update. OS is CentOS 7.

I searched stackoverflow and did not find any answer satisfying my needs. selenium etc seem like an overkill:

Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20
  • Sorry, but StackOverflow is dedicated to helping solve programming code problems. Requests for recommendations of tools are explicitly off-topic. Your Q **may be** more appropriate for [softwarerecs.se] , but read their help section regarding on-topic questions . AND please read [Help On-topic](https://stackoverflow.com/Help/On-topic) and [Help How-to-ask](https://stackoverflow.com/Help/How-to-ask) before posting more Qs here. Good luck. – shellter Nov 26 '20 at 04:23
  • I read the rules, and I updated the question wit what I tried, and with related stackoverflow questions that have answers (none helpful for my use case) – Peter Thoeny Nov 26 '20 at 04:40
  • Also, IRT on-topic, "software tools commonly used by programmers; and is a practical, answerable problem that is unique to software development". Am I wrong in assuming `wget`, `selenium`, etc are? – Peter Thoeny Nov 26 '20 at 04:48
  • If others are happy to answer your question, then I am happy for you. Your edits don't really help, as you should only use links to site them as resources, and the body of your Q should include the relevant ideas from those links. Readers shouldn't have to look at numerous other pages to understand your coding question. Just MHO, so again, will be happy for you if you get an answer. Good luck. – shellter Nov 26 '20 at 04:51

1 Answers1

2

After some research I found a solution using puppeteer headless browser. Ideally I wanted a single command like run-headless https://example.com/update, but login was required, hence driving the headless browser with puppeteer.

Installation steps for CentOS 7.6:

1. Install chrome

# cd
# mkdir install
# cd install/
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm
# yum localinstall vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/vulkan-1.1.97.0-1.el7.x86_64.rpm
# yum localinstall vulkan-1.1.97.0-1.el7.x86_64.rpm
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/liberation-fonts-1.07.2-16.el7.noarch.rpm
# yum localinstall liberation-fonts-1.07.2-16.el7.noarch.rpm
# vi /etc/yum.repos.d/google-chrome.repo
# cat /etc/yum.repos.d/google-chrome.repo
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64
enabled=1
gpgcheck=1
gpgkey=https://dl.google.com/linux/linux_signing_key.pub
# yum install google-chrome-stable

2. Install node.js

# curl -sL https://rpm.nodesource.com/setup_14.x | sudo bash -
# yum install nodejs

3. Patch /etc/sysctl.conf

This was needed to run puppeteer without disabling the sandbox:

# echo "user.max_user_namespaces=15000" >> /etc/sysctl.conf
# reboot

4. Create run-hourly.js puppeteer script

This node script has to run as a regular user, not root:

$ cd /path/to/script
$ npm install --save puppeteer
$ npm install --save pending-xhr-puppeteer
$ mkdir userDataDir
$ vi run-hourly.js # (content below)
$ node run-hourly.js

File content of run-hourly.js script:

const config = {
    userDataDir: __dirname + '/userDataDir',
    login: {
        url:        'https://www.example.com/login/',
        username:   'foobar',
        password:   'secret',
    },
    pages: [{
        url:        'https://www.example.com/update/hourly',
        pdfFile:    __dirname + '/page.pdf'
    }]
};

const puppeteer = require('puppeteer');
const { PendingXHR } = require('pending-xhr-puppeteer');

(async() => {
    // initialize headless browser
    const browser = await puppeteer.launch({
        headless:       true,   // run headless
        dumpio:         true,   // capture console log to stdout
        userDataDir:    config.userDataDir // custom user data
    });
    const page = await browser.newPage();
    const pendingXHR = new PendingXHR(page);

    // login
    await page.goto(config.login.url, {waitUntil: 'load'});
    await page.type('#loginusername', config.login.username);
    await page.type('#password', config.login.password);
    await page.click('#signin');
    await page.waitForNavigation();

    // load pages of interest
    await Promise.all(config.pages.map(async (pageCfg) => {
        await page.goto(pageCfg.url, {waitUntil: 'networkidle0'}); // wait for page load
        await page.setRequestInterception(true);  // intercept requests for next line
        await pendingXHR.waitForAllXhrFinished(); // wait for all requests to finish
        await page.pdf({path: pageCfg.pdfFile});  // generate PDF from rendered page
    }));

    await browser.close();
})();

5. Add hourly job to cron

Install the cron job as same user as the script owner

$ crontab -l
$ crontab -e
25 * * * * cd /path/to/script && node run-hourly.js > hourly.log 2>&1
Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20