1

i am trying to scrape amazon books using Cheerio and request in nodeJS

But i can't figure how to get Print length and publication date from HTML code below

<table id="productDetailsTable" cellspacing="0" cellpadding="0" border="0">
  <tbody>
    <tr>
      <td class="bucket">
        <h2>Product Details</h2>
        <div class="content">
          <ul>
            <li>
              <b>File Size:</b>
              2544 KB
            </li>
            <li>
              <b>Print Length:</b>
              658 pages
            </li>
            <li>
              <b>Publisher:</b>
              Anchor; 1st edition (September 15, 2009)
            </li>
          </ul>
        </div>
      </td>
    </tr>
  </tbody>
</table>

Any kind of help will be appreciated.Thanks.

kiranvj
  • 32,342
  • 7
  • 71
  • 76
Shafayat Alam
  • 702
  • 1
  • 14
  • 32

1 Answers1

0

You can do this by adapting the approaches in cheerio: Get normal + text nodes and How to get a text that's separated by different HTML tags in Cheerio. The .content() method gives normal and text nodes:

const $ = cheerio.load(html);
const result = [...$("#productDetailsTable .bucket .content li")].map(e =>
  [...$(e).contents()]
    .map(e => $(e).text().trim())
    .filter(Boolean)
);
console.log(result);

Which gives:

[
  [ 'File Size:', '2544 KB' ],
  [ 'Print Length:', '658 pages' ],
  [ 'Publisher:', 'Anchor; 1st edition (September 15, 2009)' ]
]

Consider also

const obj = Object.fromEntries(result.map(([a, b]) => [a.slice(0, -1), b]));

which produces:

{
  'File Size:': '2544 KB',
  'Print Length:': '658 pages',
  'Publisher:': 'Anchor; 1st edition (September 15, 2009)'
}

If you need the publication date specifically, try:

console.log(obj.Publisher.match(/(?<=\().+(?=\))/g)[0]);

which prints September 15, 2009.

ggorlen
  • 44,755
  • 7
  • 76
  • 106