1

i have never used this library before so i apologize if this may sound like a stupid question

so, what is going on is that i want to extract some specific text from this website

https://www.unieuro.it/online/

i already have browser.js, and all the requirements to make puppeteer work

so i just navigate to that website and make it search for something

keep in mind this is just a code sample

await page.goto("https://www.unieuro.it");
await page.waitForSelector('input[name=algolia-search]');
await page.type("input[name=algolia-search]","echo dot")
await page.click(".icon-search")

nothing wierd so far it works as intended, however after this step things get wierd really quickly.

first of all i was complitely unable to make it wait for any selector. i tried to make it wait for the class collapsed hits__hit, for the element article, or even the section, it would time out every single time, so i just gave up and used

await page.waitForTimeout(3000);

from here i'm trying to extract the elements that have the class: product-tile

specifically, i need the title, which is inside a a element as textContent

that a element is inside a div with the class product-tile__title so what i tried is a simple eval

var name = await page.$$eval(".product-tile__title" el => {
    el.map(el => el.querySelector("a").textContent))
    return el
})

this did not work at all, it gives me a bunch of empty objects inside an array

so i tried to install an extension called puppeteer recorder, and i tried to use the code it generated which is

 const element1 = await page.$('.collapsed:nth-child(1) > .product-tile > .item-container > .info > .title')

in this case element1 does contain something, but it is in no way related to the title

and right now i'm stuck, im complitely unable to get the object i need in anyway and the results on the internet are not helping.

as a side note:

i wish there was a simpler way to make a scraper in node, why must all the libraries be so complicated and never work like you want them to

M1S0
  • 29
  • 7

1 Answers1

1

This page has super long load times for me, probably because I'm on the other side of the planet from Italy, and some strange behavior. When I ran the search by typing with Puppeteer, the page returns 17000 results that seem totally unfiltered. I didn't bother to figure out why because I was able to go directly to the search results page using https://www.unieuro.it/online/?q=echo%20dot:

const puppeteer = require("puppeteer");

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  await page.setDefaultTimeout(10 ** 5);
  await page.setRequestInterception(true);
  page.on("request", req => {
    req.resourceType() === "image" ? req.abort() : req.continue();
  });
  await page.goto("https://www.unieuro.it/online/?q=echo%20dot");
  await page.waitForSelector(".product-tile__title");
  const titles = await page.$$eval(
    ".product-tile__title ",
    els => els.map(e => e.textContent)
  );
  console.log(titles.map(e => e.trim()));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

Output:

[
  'Amazon Echo Dot',
  'Amazon Echo Dot (4th gen)',
  'Amazon Echo Dot (4th Gen)',
  'Amazon Echo Dot (4th gen)',
  'Amazon Echo Dot (4th gen)',
  'Amazon Echo Dot (4th gen)'
]

I haven't tested this extensively (for example, on pages with tons of results), and there should be plenty of opportunity to improve load times by profiling and blocking irrelevant requests.

A probably better approach is to just hit the API directly and skip Puppeteer entirely:

const url = "https://mnbcenyfii-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for JavaScript (3.35.1); Browser; JS Helper (2.28.0)&x-algolia-application-id=MNBCENYFII&x-algolia-api-key=977ed8d06b718d4929ca789c78c4107a";
const body = `{"requests":[{"indexName":"sgmproducts_prod","params":"query=echo%20dot&hitsPerPage=20&maxValuesPerFacet=20&page=0"}]}`;
fetch(url, {method: "post", body})
  .then(response => {
    if (!response.ok) {
      throw Error(response.status);
    }
    
    return response.json();
  })
  .then(data => data.results[0].hits.map(e => e.title_it))
  .then(results => console.log(results))
  .catch(err => console.error(err))
;

You can port this to Node easily with node-fetch or axios.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • 1
    wow thank you, i didn't even know they had an api, and no, its not you, the page is just terrible, that is why i kinda assumned they closed their apis, anyhow i already managed to find a solution on my own in the end, i will still mark this as solution because it's the same thing i did – M1S0 Aug 24 '21 at 14:24