Possible to run Headless Chrome/Chromium in a Google Cloud Function?

Question

Is there any way to run Headless Chrome/Chromium in a Google Cloud Function? I understand I can include and run statically compiled binaries in GCF. Can I get a statically compiled version of Chrome that would work for this?

Some are working on it https://github.com/adieuadieu/serverless-chrome — Palani, May 11 '17 at 04:18

score 16 · Answer 1 · edited Aug 06 '18 at 15:35

16

The Node.js 8 runtime for Google Cloud Functions now includes all the necessary OS packages to run Headless Chrome.

Here is a code sample of an HTTP function that returns screenshots:

Main index.js file:

const puppeteer = require('puppeteer');

exports.screenshot = async (req, res) => {
  const url = req.query.url;

  if (!url) {
    return res.send('Please provide URL as GET parameter, for example: <a href="?url=https://example.com">?url=https://example.com</a>');
  }

  const browser = await puppeteer.launch({
    args: ['--no-sandbox']
  });
  const page = await browser.newPage();
  await page.goto(url);
  const imageBuffer = await page.screenshot();
  await browser.close();

  res.set('Content-Type', 'image/png');
  res.send(imageBuffer);
}

and package.json

{
  "name": "screenshot",
  "version": "0.0.1",
  "dependencies": {
    "puppeteer": "^1.6.2"
  }
}

edited Aug 06 '18 at 15:35

ebidel

23,921
3
63
76

answered Aug 01 '18 at 03:58

Steren

7,311
3
31
51

would puppeteer be a dependency that is cached by Google so it's optimized or is it something that would require more resources (i.e memory, cpu) to use – jasan Aug 08 '18 at 20:53
the function does not get deployed on my firebase account. Other (non async) do. I have npm 8 enabled – daniel Aug 13 '18 at 14:20
Any chance of getting support for the Java runtime so this can work with Selenium? – kashiB Sep 09 '18 at 05:17
2

@ebidel, does the Python env have the equivalent packages to run Headless chrome? – FKrauss Mar 26 '19 at 13:51

score 6 · Answer 2 · answered Mar 27 '18 at 20:56

I've just deployed a GCF function running headless Chrome. A couple takeways:

you have to statically compile Chromium and NSS on Debian 8
you have to patch environment variables to point to NSS before launching Chromium
performance is much worse than what you'd get on AWS Lambda (3+ seconds)

For 1, you should be able to find plenty of instructions online.

For 2, the code that I'm using is the following:

static executablePath() {
  let bin = path.join(__dirname, '..', 'bin', 'chromium');
  let nss = path.join(__dirname, '..', 'bin', 'nss', 'Linux3.16_x86_64_cc_glibc_PTH_64_OPT.OBJ');

  if (process.env.PATH === undefined) {
    process.env.PATH = path.join(nss, 'bin');
  } else if (process.env.PATH.indexOf(nss) === -1) {
    process.env.PATH = [path.join(nss, 'bin'), process.env.PATH].join(':');
  }

  if (process.env.LD_LIBRARY_PATH === undefined) {
    process.env.LD_LIBRARY_PATH = path.join(nss, 'lib');
  } else if (process.env.LD_LIBRARY_PATH.indexOf(nss) === -1) {
    process.env.LD_LIBRARY_PATH = [path.join(nss, 'lib'), process.env.LD_LIBRARY_PATH].join(':');
  }

  if (fs.existsSync('/tmp/chromium') === true) {
    return '/tmp/chromium';
  }

  return new Promise(
    (resolve, reject) => {
      try {
        fs.chmod(bin, '0755', () => {
          fs.symlinkSync(bin, '/tmp/chromium'); return resolve('/tmp/chromium');
        });
      } catch (error) {
        return reject(error);
      }
    }
  );
}

You also need to use a few required arguments when starting Chrome, namely:

--disable-dev-shm-usage
--disable-setuid-sandbox
--no-first-run
--no-sandbox
--no-zygote
--single-process

I hope this helps.

score 0 · Answer 3 · answered May 31 '17 at 18:12

0

As mentioned in the comment, work is being done on a possible solution to running a headless browser in a cloud function. A directly applicable discussion:"headless chrome & aws lambda" can be read on Google Groups.

answered May 31 '17 at 18:12

George

1,488
1
10
13

score -1 · Answer 4 · answered Aug 13 '18 at 18:35

The question at. had was can you run headless chrome or chromium in Firebase Cloud Functions... the answer is NO! since the node.js project will not have access any chrome/chromium executables and therefore will fail! (TRUST ME - I've Tried!).

A better solutions is to use the Phantom npm package, which uses PhantomJS under the hood: https://www.npmjs.com/package/phantom

Docs and info can be found here:

http://amirraminfar.com/phantomjs-node/#/

or

https://github.com/amir20/phantomjs-node

The site i was trying to crawl had implemented screen scraping software, the trick is to wait for the page to load by searching for expected string, or regex match, i.e. i do a regex for a , if you need a regex of any complexity made for you - get in touch at https://AppLogics.uk/ - starting at £5 (GPB).

here is a typescript snippet to make the http or https call:

        const phantom = require('phantom');
        const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
        const page: any = await instance.createPage();
        const status = await page.open('https://somewebsite.co.uk/');
        const content = await page.property('content');

same again in JavaScript:

        const phantom = require('phantom');
        const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
        const page = yield instance.createPage();
        const status = yield page.open('https://somewebsite.co.uk/');
        const content = yield page.property('content');

Thats the easy bit! if its a static page your pretty much done and you can parse the HTML into something like the cheerio npm package: https://github.com/cheeriojs/cheerio - an implementation of core JQuery designed for servers!

However if it is a dynamically loading page, i.e. lazy loading, or even anti-scraping methods, you will need to wait for the page to update by looping and calling the page.property('content') method and running a text search or regex to see if your page has finished loading.

I have created a generic asynchronous function returning the page content (as a string) on success and throws an exception on failure or timeout. It takes as parameters the variables for the page, text (string to search for that indicates success), error (string to indicate failure or null to not check for error), and timeout (number - self explanatory):

TypeScript:

    async function waitForPageToLoadStr(page: any, text: string, error: string, timeout: number): Promise<string> {
        const maxTime = timeout ? (new Date()).getTime() + timeout : null;
        let html: string = '';
        html = await page.property('content');
        async function loop(): Promise<string>{
            async function checkSuccess(): Promise <boolean> {
                html = await page.property('content');
                if (!isNullOrUndefined(error) && html.includes(error)) {
                    throw new Error(`Error string found: ${ error }`);
                }
                if (maxTime && (new Date()).getTime() >= maxTime) {
                    throw new Error(`Timed out waiting for string: ${ text }`);
                }
                return html.includes(text)
            }
            if (await checkSuccess()){
                return html;
            } else {
                return loop();
            }                
        }
        return await loop();
    }

JavaScript:

    function waitForPageToLoadStr(page, text, error, timeout) {
            return __awaiter(this, void 0, void 0, function* () {
                const maxTime = timeout ? (new Date()).getTime() + timeout : null;
                let html = '';
                html = yield page.property('content');
                function loop() {
                    return __awaiter(this, void 0, void 0, function* () {
                        function checkSuccess() {
                            return __awaiter(this, void 0, void 0, function* () {
                                html = yield page.property('content');
                                if (!isNullOrUndefined(error) && html.includes(error)) {
                                    throw new Error(`Error string found: ${error}`);
                                }
                                if (maxTime && (new Date()).getTime() >= maxTime) {
                                    throw new Error(`Timed out waiting for string: ${text}`);
                                }
                                return html.includes(text);
                            });
                        }
                        if (yield checkSuccess()) {
                            return html;
                        }
                        else {
                            return loop();
                        }
                    });
                }
                return yield loop();
            });
        }

I have personally used this function like this:

TypeScript:

    try {
        const phantom = require('phantom');
        const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
        const page: any = await instance.createPage();
        const status = await page.open('https://somewebsite.co.uk/');
        await waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000);
    } catch (error) {
        console.error(error);
    }

JavaScript:

    try {
        const phantom = require('phantom');
        const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
        const page = yield instance.createPage();
        yield page.open('https://vehicleenquiry.service.gov.uk/');
        yield waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000);
    } catch (error) {
        console.error(error);
    }

Happy crawling!

Possible to run Headless Chrome/Chromium in a Google Cloud Function?

4 Answers4

Linked