0

I've encountered a problem involving pseudo-elements of "a" (hyperlink) elements. I'm parsing a set of web pages with unknown structures, looking for images that are links. In other words, if the user clicks on that image, they will be taken to another web page.

In the typical case where the img element is a descendant of an "a" element, this is trivial: ascend the DOM tree from the img node and look for an "a" element.

But I recently encountered a construct like this:

<div>
    <a class="X" href="Y"></a> 
    <div>
        <img ....>
    </div>
</div>

That's an empty "a" element (with zero height) behind the img. But when I click on the img, I go to the URL Y.

When I look at the CSS, I see something like:

a.X:after {
    bottom: 0;
    content: "";
    left: 0;
    position: absolute;
    right: 0;
    top: 0;
    z-index: 1;
}

*::after {
    box-sizing: inherit;
}

It would seem that there is a pseudo-element after the "a" element but before the img element (in the DOM tree), which is the same size as the containing div, and which acts - for the purposes of being a link - as though it is part of the "a" element. It's the same size as the img element but is above it (z-index: 1).

I have two questions:

  1. Have I more or less understood correctly what is going on?

  2. Is it possible to detect this situation - to find the URL that is de facto linked to the img element - by walking the DOM tree?

Timmy K
  • 291
  • 1
  • 2
  • 14

1 Answers1

1

Okay so you understood it more or less correctly

The pseudo-element is positioned absolutely so it has no regard for it's place in the DOM tree and in this case made to overlap the img and be above it.

You could try to create some code to check for the overlap between the after pseudo-element and the image as was done in here: How to detect when an element over another element in JavaScript?

My recomendation however, assuming the structure of the "posts" or anything else you're scraping is somewhat repeating itself, would be to just hardcode the location of the link in the DOM tree and scrape it from there.

If the articles or posts prove to have multiple links you could check for something reapeating about the one containing the link you're interested in, perhaps a certain link structure, css property, basically anything.

Blye
  • 619
  • 4
  • 20
  • 1
    Good answer, just want to add that the structure described in OP is unconventional and un-semantic, it would also not work without CSS and/or JS. So indeed if it's a specific site that is important to be rechecked over time hardcode some extra location logic, but else just don't bother with that site until they fix their thumbnail structure. – webketje Oct 23 '22 at 10:46
  • 1
    Thanks. I understand the suggestion to use hard-coded features but I specifically don't want to do that: I'm writing a browser extension and would like it to work on as many sites as possible without knowing anything about them a priori. The suggestion with regard to overlap sounds good and might yield the generality I'm after. – Timmy K Oct 23 '22 at 16:20
  • If its an extension I would keep in mind this can be very resource intensive for the user – Blye Oct 23 '22 at 18:50