Using Artoo.js with Google Puppeteer for Web Scraping

Question

I can't seem to be able to use Artoo.js with Puppeteer.

I tried using it through npm install artoo-js, but it did not work.

I also tried injecting the build path distribution using the Puppeteer command page.injectFile(filePath), but I had no luck.

Was anyone able to implement these two libraries successfully?

If so, I would love a code snippet of how Artoo.js was injected.

I dont have exact answer for your question. But I wrote a piece on [Web scrapping with Puppeteer & Chrome Headless](https://medium.com/@e_mad_ehsan/getting-started-with-puppeteer-and-chrome-headless-for-web-scrapping-6bf5979dee3e). Might be helpful. — eMad, Aug 26 '17 at 05:52

score 4 · Accepted Answer · answered Aug 30 '17 at 16:39

I just tried Puppeteer for another answer, I figured I could try Artoo too, so here you go :)

(Step 0 : Install Yarn if you don't have it)

yarn init
yarn add puppeteer
# Download latest artoo script, not as a yarn dependency here because it won't be by the Node JS runtime
wget https://medialab.github.io/artoo/public/dist/artoo-latest.min.js

Save this in index.js:

const puppeteer = require('puppeteer');
(async() => {
    const url = 'https://news.ycombinator.com/';
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Go to URL and wait for page to load
    await page.goto(url, {waitUntil: 'networkidle'});
    // Inject Artoo into page's JS context
    await page.injectFile('artoo-latest.min.js');
    // Sleeping 2s to let Artoo initialize (I don't have a more elegant solution right now)
    await new Promise(res => setTimeout(res, 2000))
    // Use Artoo from page's JS context
    const result = await page.evaluate(() => {
        return artoo.scrape('td.title:nth-child(3)', {
            title: {sel: 'a'},
            url: {sel: 'a', attr: 'href'}
        });
    });
    console.log(`Result has ${result.length} items, first one is:`, result[0]);
    browser.close();
})();

Result:

$ node index.js 
Result has 30 items, first one is: { title: 'Headless mode in Firefoxdeveloper.mozilla.org',
url: 'https://developer.mozilla.org/en-US/Firefox/Headless_mode' }

_{This is too funny to miss: right now the top article of HackerNews is about Firefox Headless...}

Yeah don't use Artoo's NPM packages for that, if I understand correctly they are not suitable for web scraping (extracting data from DOM in browser JS runtime), they are suitable for extracting data from other XML documents from Node JS runtime. The URL I used is the one they use in their bookmarklet. — Hugues M., Aug 31 '17 at 07:00
About waiting for Artoo to initialize, you can simply use: page.waitFor(2000) — Ernest, Dec 15 '17 at 10:55

Using Artoo.js with Google Puppeteer for Web Scraping

1 Answers1