0

In an attempt to make web scraping with a headless browser more resilient to site changes, I'd like to combine technical properties of the elements with their visual characteristics.

E.g. when looking for a search bar, I'd like to look for a "big (>50% width), visible (:visible) text input field (<input type="text">) in the upper half of the screen/rendered page." Then, when looking for the submit button, I'd like to find a button located near the aforementioned search bar.

Is there any way to set up this kind of search criterion? AFAICS, CSS selectors and XPath can only search by predefined parameters (tag, id, class, attributes), not by calculated ones.

The best idea I currently have is to search by predefined parameters, then filter the result further by getting size, position and such for each result and comparing them to the desired ranges. This is rather slow oftentimes since I have to use expressions like *[text()="visible text"] to not rely on technical details that are subject to change without notice.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
  • You can search for whatever you want. What you want to do is possible and if you look around you can make quite advanced CSS selections in javascript. – Simon Hyll Jan 27 '18 at 14:36
  • @SimonHyll I don't see anything like calculated properties at https://www.w3schools.com/cssref/css_selectors.asp . Any pointers/keywords? [You know, you can only find it quickly if you already know what to look for.](https://meta.serverfault.com/questions/8934/what-to-do-with-questions-when-the-answer-is-in-a-man-page) – ivan_pozdeev Jan 28 '18 at 00:16

1 Answers1

1

Here are a few examples of ways to find your wanted element. All below examples are based on the assumption that you have an element that looks a little like this (can be different type and css elsewhere, but basically that you have an element somewhere with some styling and some attribute).

<div mycustomattribute="login" style="width:calc(5cm - 3cm)"></div>

Note that the below examples aren't necessarily all I the ways I can give you, it's just the ones I could think of on the fly, if your problem isn't resolved using these I can probably think of one or two more ways to solve your problem.

Selecting using a custom attribute

You can set any attribute you want on any element you want. For example, if you want <div mycustomattribute="hello"> and then querySelect that, it's totally valid.

var test = document.querySelect("div[mycustomattribute=login]")

The above script will select only the div that has an attribute name with the value login. I think you already know of this method but figured I'd mention it because it's by far the easiest, least hacky way of finding a specific element, if you can set an attribute on your element that is.

Select using position

Lets say you want to select the nearest element that is 50 px to the right of the element you selected.

var base = document.querySelect("div[name=login]")
// Get Y coordinate of base element
var y = base.getBoundingClientRect().top;
// Get X coordinate of base element on its right side, since we're gonna look to the right of it
var x = base.getBoundingClientRect().right;
// Find the element that is 50 pixels to the right of our base element
var element = document.elementFromPoint(x + 50, y);

Select using CSS values

This is more tricky but certainly possible. You are correct in that you can't just run querySelector to find an element based on a CSS value (calculated or otherwise), but you can run the calculation yourself to get the value your desired element should have and then just loop through them to get the one you want.

So, for example:

var divs = document.querySelectorAll('div');
var element = null;

for (i = 0; i < divs.length; ++i) {
    /* We assume you know the result of the calculated value, either because it's
a static result (e.g. `5cm - 3cm`), or because you rerun the calculation in
javascript to find out what its result is.
Note that you can use whatever style you want here to find the div, like
"visible" or "display" or whatever you want, just set up the proper if
statements.
    */
    if(div.style.width = "2cm") {
        element = div;
        break;
    }
}

References

This is a little side note but try to use mozilla instead of w3schools, mozilla is way better for references. I was hesitant too at first to make the jump to mozillas documentation but it really is way better once you learn how to use it.

https://developer.mozilla.org/en-US/docs/Web/API/Document
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll
https://developer.mozilla.org/en-US/docs/Web/API/Document/elementFromPoint
https://css-tricks.com/snippets/javascript/loop-queryselectorall-matches/

Simon Hyll
  • 3,265
  • 3
  • 24
  • 44
  • Custom attribute seems irrelevant to my use case: to apply it, I need to have already found the element. – ivan_pozdeev Jan 28 '18 at 02:11
  • Other than that, this is the same as my current idea -- to apply filtering on visual size/position by hand. While you claimed earlier that it's possible right away with "advanced CSS selections". `elementFromPoint` is not for my use case 'cuz I'll never have an exact position, only a general area, but it's useful as a pointer. Links are useful, too. – ivan_pozdeev Jan 28 '18 at 02:17
  • Since this is a headless browser, not simple HTML parsing, I can invoke JS. I just don't want to affect the page in any way that a user wouldn't so as not to break things or invoke some page-mutating logic unexpectedly. Retrieveing info without making changes should be okay. – ivan_pozdeev Jan 28 '18 at 03:24