0

Challenge

When using puppeteer page.click('something') one of the first challenges is to make sure that the right 'something' is provided.

I guess this is a very common challenge, yet I did not find any simple way to achieve this.

What I tried so far

In Google Chrome I inspect the element that I want to click. I then get an extensive element description with a class and such. Based on an example I found, my approach is now:

  1. Take the class
  2. Replace all spaces with dots
  3. Try
  4. If it fails, check what is around this and add it as a prefix, for example one or two instances of button.

This does not exactly feel like it is the best way (and sometimes also fails, perhaps due to inaccuracies from my side).

One thing that I notice is that Chrome actually often seems to give a hint hovering over the thing I want to click, I am not sure if that is right but I also did not see a way to copy that (and it can be quite long). image from tutorialspoint

If there is a totally different recommended way (e.g. Looking in the browser for what the name roughly is, and then using puppeteer to list all possible things), that is also fine. I just want to get the right input for page.click()

If you need an example of what I am trying: If you open this question in an incognito tab, you get options like share or follow. Or if you go to a web shop like staples and want to add something to cart.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
  • what you (seem to) looking for is: you can right-click on the highlighted DOM element in the DevTools 'Elements' tab (at your screenshot it is the `` code block highlighted with blue). then select _'Copy' > 'Copy selector'_ . it might be a longer selector expression but you can trim the unwanted parts. – theDavidBarton Oct 01 '22 at 09:39
  • While what David recommends is possible, [I don't recommend browser-generated selectors](https://serpapi.com/blog/puppeteer-antipatterns/#misusing-developer-tools-generated-selectors). They tend to be overly rigid and there's almost always a cleaner selector to get to an element. There's no substitute for learning CSS selectors and building up intuition about the most robust way to query something. This question is pretty broad--it sounds like you have a real use case here, so I suggest asking about that and providing a [mcve] and full specification. I can explain my selector reasoning then. – ggorlen Oct 01 '22 at 14:27
  • For example, those elements have ids, which are nearly always the best CSS selectors since they're unique on the page (or supposed to be, and usually are, rare non-complaint documents aside). I can't read the id in the low res image (prefer [text](https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors-when-asking-a-question)), but it's something like `page.click("#gsc-l-id1")`. If you try to copy by means of right click, you'll probably get some massive, incomprehensible selector that will break if even the slightest change occurs on the page. – ggorlen Oct 01 '22 at 14:33
  • @ggorlen As a tangible example: At the bottom of this (each) question is a 'Question feed' button. When you click it there is a popup? with a close button. I was actually able to find a way open and close that popup programatically after reading the comments but am still curious how you would approach the choice of selector for this. The actual steps/thought process would make a nice answer. – Dennis Jaheruddin Oct 01 '22 at 20:22
  • Yes, I see the RSS feed popup with a close button here. Are you asking how I'd approach selecting that button? The answer is usually "it depends"--how robust should the script be, what sort of parent tree are we dealing with, do we expect the site structure to change, etc. But the browser-generated selector (Chrome 105.0.5195.127) is basically useless: `body > aside > div > a`. In Puppeteer, you can press the "escape" key, which avoids selecting things entirely. – ggorlen Oct 01 '22 at 20:36

1 Answers1

1

When using puppeteer page.click('something') one of the first challenges is to make sure that the right 'something' is provided.

Just to be clear, "something" is a CSS selector, so your question seems to reduce to how to write CSS selectors that are accurate. Or, since Puppeteer offers XPath and traditional DOM traversals, we could extend it to include those selection tools as well.

Broader still, if there's a data goal we're interested in, often times there are other routes to get the data that don't involve touching the document at all.

I guess this is a very common challenge, yet I did not find any simple way to achieve this.

That's because there is no simple way to achieve this. It's like asking for the one baseball swing that hits all pitches. Web pages have messy, complex, arbitrary structures that follow thousands of different conventions (or no conventions at all). They can serve up a slightly or completely different page structure on any request. There's no silver-bullet strategy for writing good CSS selectors, and no step-by-step algorithm you can apply to universally "solve" the problem of accurately and robustly selecting elements.

Your goal should be to learn the toolkit and then practice on many different pages to develop an intuition for which tools and tricks work in which contexts and be able to correctly discern the tradeoffs in different approaches. Writing a full guide to this is out of scope, and articles exist elsewhere that cover this in depth, but here are a few high-level rules of thumb:

  • Look at context: consider the goals of your project, the general structure of the page and patterns on the page. Too many questions on Stack Overflow regarding CSS selectors (but also in general) omit context, which severely constrains the recommendation space, often leading to an XY problem. A few factors that are often relevant:
    • Whether the scrape is intended to be one-off or a long-running script that should try to anticipate and be resillient to page changes over time
    • Development time/cost/goal tradeoffs
    • Whether the data can be obtained by other means than the DOM, like accessing an API, pulling a JSON blob from a <script> tag, accessing a global variable on the window or intercepting a network response.
    • Considering nesting: is the element in a frame or shadow DOM?
    • Considering whole-page context: which patterns does the site tend to follow? Are there parent elements that are useful to selecting a child? (often, this is a distant relationship, not visible in a screenshot as provided by OP)
    • Consider all capabilities provided by your toolkit. For example, OP asked for a selector to close a modal on Stack Overflow; it turns out that none of the elements have particularly great CSS selectors, so using Puppeteer to trigger an Esc key press might be more robust.
  • Keep it simple: since pages can change at any time, the more constraints you add to the selector, the more likely one of those assumptions will no longer be true, introducing unnecessary points of failure.
  • Look for unique identifiers first: ids are usually unique on a page (some Google pages seem to scoff at this rule), so those are usually the best bets. For elements without an id, my next steps are typically:
    • Look for an id in a close parent element and use that, then select the child based on its next-most-unique identifier, usually a class name or combination tag name and attribute (like an input field with a name attribute, for example).
    • If there are few ids or none nearby, check whether the class name or attribute that is unique. If so, consider using that, likely coupled with a parent container class.
  • When selecting between class names, pay attention to those that seem temporary or stateful and might be added and removed dynamically. For example, a class of .highlighted-tab might disappear when the element isn't highlighted.
  • Prefer "bespoke" class names that seem tied to role or logic over generic library class names associated with styling (bootstrap, semantic UI, material UI, tailwind, etc).
  • Avoid the > operator which can be too rigid, unless you need precision to disambiguate a tree where no other identifiers are available.
  • Avoid sibling selectors unless unavoidable. Siblings often have more tenuous relationships than parents and children.
  • Avoid nth-child and nth-of type to the extent possibe. Lists are often reordered or may have fewer or more elements than you expect.
  • When using anything related to text, generally trim whitespace, ignore case and special characters where appropriate and prefer substrings over exact equality. On the other hand, don't be too loose. Usually, text content and values are weak targets but sometimes necessary.
  • Avoid pointless steps in a selector, like body > div#container > p > .target which should just be #container .target or #container p .target. body says almost nothing, > is too rigid, div isn't necessary since we have an id (if it changes to a span our new selector will still work), and the p is generic--there are probably no .targets outside of ps anyway.
  • Avoid browser-generated selectors. These are usually the worst of both worlds: highly vague and rigid at the same time. The goal is to be the opposite: accurate and specific, yet as flexible as possible.
  • Feel free to break rules as appropriate.
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Thank you, indeed most questions don't provide enough context (presumably because askers don't know what is relevant) and also most answers don't provide much reasoning. This could help people improve both! – Dennis Jaheruddin Oct 03 '22 at 06:16