1

I'm scraping a website and I'm using Cheerio and Puppeteer. I need to click a certain button with a given text. Here is my code:

    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://www.website.com', {waitUntil: 'networkidle0'});
    const html = await page.content();

    const $ = cheerio.load(html);
    
    const items = [];
    $('.grid-table-container').each((index, element) => {
        items.push({
            element: $($('.grid-option-name', element)[0]).contents().not($('.grid-option-name', element).children()).text() },
            button: $('.grid-option-selectable>div', element)
        });
    });


    items.forEach(item => {
        if (item.element === 'Foo Bar') {
            await page.click(item.button);
        }
    });

Here is the markup I'm trying to scrape:

<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table">
    <div class="grid-item">
        <div class="grid-item-container">
            <div class="grid-table-container>
                <div class="grid-option-header">
                    <div class="grid-option-caption">
                        <div class="grid-option-name">
                            Foo Bar
                            <span>some other text</span>
                        </div>
                    </div>
                </div>
                <div class="grid-option-table">
                    <div class="grid-option">
                        <div class="grid-option-selectable">
                            <div></div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div>
<div class="item-table"></div>
<div class="item-table"></div>

Clicking on Cheerio element doesn't work. So, does exist any way to do it?

smartmouse
  • 13,912
  • 34
  • 100
  • 166
  • Is `item.button` a string? Isn't it some object? `page.click()` takes a string as the first argument (https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pageclickselector-options) – pavelsaman Nov 20 '20 at 11:32
  • I read documentation and yes, it isn't a string. It is for this reason that I'm asking if there is a way to do what I need to do. For example some way to get string from Cheerio element, any other method of Puppeteer that can handle Cheerio objects and so on... – smartmouse Nov 20 '20 at 11:37
  • `await item.button.click()` doesn't work? – pavelsaman Nov 20 '20 at 11:46
  • 2
    No, because Cheerio is not a web browser, as stated here: https://stackoverflow.com/questions/56675374/how-to-fix-click-is-not-a-function-in-node-cheerio – smartmouse Nov 20 '20 at 11:52

2 Answers2

0

You could add jquery to the page and do it there:

await page.addScriptTag({path: "jquery.js"})
await page.evaluate(() => {
  // do jquery stuff here  
})
pguardiario
  • 53,827
  • 19
  • 119
  • 159
0

There's no way to do this. Puppeteer is a totally different API from Cheerio. The two don't talk to each other or interoperate at all. The only thing you can do is snapshot an HTML string in Puppeteer and pass it to Cheerio.

Puppeteer works in the browser context on the live website, with native XPath and CSS capabilities--basically, all the power of the browser at your disposal.

On the other hand, Cheerio is a Node-based HTML parser that simulates a tiny portion of the browser environment. It offers a small subset of Puppeteer's functionality, so don't use Cheerio and Puppeteer together under most circumstances.

Taking a snapshot of the live site, then re-parsing the string into a tree Cheerio can work with is confusing, inefficient and offers few obvious advantages over using the actual thing that's right in front of you. It's like buying a bike just to carry it around.

The solution is to stick with Puppeteer ElementHandle objects:

const puppeteer = require("puppeteer"); // ^19.0.0

const html = `
<div class="item-table">
  <div class="grid-item">
    <div class="grid-item-container">
      <div class="grid-table-container">
        <div class="grid-option-header">
          <div class="grid-option-caption">
            <div class="grid-option-name">
              Foo Bar
              <span>some other text</span>
            </div>
          </div>
        </div>
        <div class="grid-option-table">
          <div class="grid-option">
            <div class="grid-option-selectable">
              <div></div>
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>
</div>
<script>
// for testing purposes
const el = document.querySelector(".grid-option-selectable > div");
el.addEventListener("click", e => e.target.textContent = "clicked");
el.style.height = el.style.width = "50px";
</script>
`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html);

  for (const el of await page.$$(".grid-item-container")) {
    const text = await el.$eval(
      ".grid-option-name",
      el => el.childNodes[0].textContent
    );
    const sel = ".grid-option-selectable > div";

    if (text.trim() === "Foo Bar") {
      const selectable = await el.$(sel);
      await selectable.click();
    }

    console.log(await el.$eval(sel, el => el.textContent)); // => clicked
  }
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Or perform your click in the browser:

await page.$$eval(".grid-item-container", els => els.forEach(el => {
  const text = el.querySelector(".grid-option-name")
    .childNodes[0].textContent.trim();

  if (text.trim() === "Foo Bar") {
    document.querySelector(".grid-option-selectable > div").click();
  }
}));

You might consider selecting using an XPath or iterating childNodes to examine all text nodes rather than assuming the text is at position 0, but I've left these as exercises to focus on the main point at hand.

ggorlen
  • 44,755
  • 7
  • 76
  • 106