4

I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:

<table>
<tr>
    <th>Product name</th>
    <td>Shakeweight</td>
</tr>
<tr>
    <th>Product category</th>
    <td>Exercise equipment</td>
</tr>
<tr>
    <th>Manufacturer name</th>
    <td>The Shakeweight Company</td>
</tr>
<tr>
    <th>Manufacturer address</th>
    <td>
        <table>
            <tr><td>123 Fake Street</td></tr>
            <tr><td>Springfield, MO</td></tr>
        </table>
    </td>
</tr>

In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.

I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:

var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)

This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.

MacGruber
  • 43
  • 1
  • 4
  • Related: [How to click on element with text in Puppeteer](https://stackoverflow.com/questions/47407791/how-to-click-on-element-with-text-in-puppeteer) – ggorlen Mar 05 '23 at 06:15

4 Answers4

3

This is really an xpath question and isn't specific to puppeteer, so this question might also help, as you're going to need to find the <td> that comes after the <th> you've found: XPath:: Get following Sibling

But your xpath does work for me. In Chrome DevTools on the page with the HTML in your question, run this line to query the document:

$x('//th[text()="Manufacturer name"]')

NOTE: $x() is a helper function that only works in Chrome DevTools, though Puppeteer has a similar Page.$x function.

That expression should return an array with one element, the <th> with that text in the query. To get the <td> next to it:

$x('//th[text()="Manufacturer name"]/following-sibling::td')

And to get its inner text:

$x('//th[text()="Manufacturer name"]/following-sibling::td')[0].innerText

Once you're able to follow that pattern you should be able to use similar strategies to get the data you want in puppeteer, similar to this:

const puppeteer = require('puppeteer');

const main = async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://127.0.0.1:8080/');  // <-- EDIT THIS

  const mfg = await page.$x('//th[text()="Manufacturer name"]/following-sibling::td');
  const prop = await mfg[0].getProperty('innerText');
  const text = await prop.jsonValue();
  console.log(text);

  await browser.close();
}

main();
Todd Price
  • 2,650
  • 1
  • 18
  • 26
1

As per your use case explanation in the above answer, here is the logic for the use case:

await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url

await page.waitFor('table'); //waitFor an element that contains the text

const textDataArr = await page.evaluate(() => {
    const trArr = Array.from(document.querySelectorAll('table tbody tr'));

    //Find an index of a tr row where th innerText equals 'Manufacturer name'
    let fetchValueRowIndex = trArr.findIndex((v, i) => {
        const element = document.querySelector('table tbody tr:nth-child(i+1) th');
        return element.innerText === 'Manufacturer name';
    });

    //If the findex is found return the innerText of td of the same row else returns undefined
    return (fetchValueRowIndex > -1) ? document.querySelector(`table tbody tr:nth-child(${fetchValueRowIndex}+1) td`).innerText : undefined;
});
console.log(textDataArr);
kavigun
  • 2,219
  • 2
  • 14
  • 33
0

You can do something like this to get the data:

await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url

await page.waitFor('table'); //waitFor an element that contains the text

const textDataArr = await page.evaluate(() => {
    const element = document.querySelector('table tbody tr:nth-child(3) td'); // select thrid row td element like so
    return element && element.innerText; // will return text and undefined if the element is not found
});
console.log(textDataArr);
kavigun
  • 2,219
  • 2
  • 14
  • 33
  • Thanks for the response - unfortunately the order of the rows of this table is not always the same, so I can't just select the 3rd and 4th td. There are also no ids or classes - I need to select the td based on the inner text of the th of the same tr being "Manufacturer name" or "Manufacturer address" – MacGruber Sep 24 '20 at 18:01
  • I posted a new answer for the use case you are clarifying here, try that logic it will work for you. – kavigun Sep 24 '20 at 18:45
0

A simple way to get those all at once:

let data = await page.evaluate(() => {
  return [...document.querySelectorAll('tr')].reduce((acc, tr, i) => {
    let cells = [...tr.querySelectorAll('th,td')].map(el => el.innerText)
    acc[cells[0]] = cells[1]
    return acc
  }, {})
})
pguardiario
  • 53,827
  • 19
  • 119
  • 159