Parsing awful HTML: How do I recognize boundaries with xpath?

Question

This is almost going to sound like a joke, but I promise you this is real life. There is a site on the internet, one which you have all used, that does not believe in css classes. Everything is defined directly in the style tag on an element. It's horrifying.

My problem though is that it also makes the html extraordinarily difficult to parse. The structure that I've got to go on looks something like this:

<td>
    <a name="<random_string>"></a>
    <div style="generic-style, used by other elements">
        <div style="similarly generic style">{some_stuff}</div>
    </div>
    <a name="<random_string>"></a>
    ...
</td>

Basically, I've got these a tags that are forming the boundaries of the reviews, whos only defining information is the random string that is their name. I don't actually care about the anchor tags, but I would like to grab the reviews between them using xpath.

I've looked into sibling queries, but they don't seem to be well suited for alternating boundaries. I also looked into the Kayessian method of xpath queries, which (aside from having an awesome name) only seems well suited to grab a particular div, rather than all divs between the anchor tags.

Any thoughts on how I could grab the divs here?

@JoshBurgess Thank you for the condolences. Guess which site it is? — Slater Victoroff, Aug 06 '15 at 16:01
Really, Spanish amazon? Google's got the same issue. US Amazon's not structured like that though. Regardless, you have my sympathy. — Josh Burgess, Aug 06 '15 at 16:06
@JoshBurgess, yea, it seems to be the only alternate-language version of amazon with this problem. The rest are all pretty reasonable. — Slater Victoroff, Aug 06 '15 at 16:07
You could try tweeting them to kick them into touch. However, I wouldn't be surprised if Amazon respond with a Cease and Desist if they become aware that someone is scraping their site. I've heard anecdotally that a certain popular auction site is legally aggressive, and I imagine the biggest web properties are all like that. — halfer, Aug 06 '15 at 16:23
@Slater: heh, well some diplomacy might help! But in general, yes: the likes of Amazon want to be seen as technical thought leaders, and web techniques from 2001 don't help their cause. Do they maintain an engineering Twitter account? — halfer, Aug 06 '15 at 17:05

score 1 · Answer 1 · answered Aug 06 '15 at 16:18

1

I figured it out! It turns out that xpath will allow for relative attribute assertions. I am not sure if this behavior is desired, but it happens to work in this case! Here's the xpath:

//td/div[../a[@name]]

Nice and clean, the ../a[@name] basically just says:

Go up a level, and make sure on that level of the hierarchy there's an a element with a name attribute

answered Aug 06 '15 at 16:18

Slater Victoroff

21,376
21
85
144

1

1) Does this really solve your issue? - Any `div` with a sibling `a`, irrespective of order or div nesting? 2) Then it's the same as `//td/a[@name]/../div`. – JimmyB Aug 06 '15 at 16:25
@HannoBinder, it is not an ideal solution, but it does technically solve the problem for me. I'm not going to accept it because I think there are probably better solutions. This is... a solution that just happens to work, and it does appear that selector is equivalent. – Slater Victoroff Aug 06 '15 at 16:28

score 1 · Accepted Answer · answered Aug 07 '15 at 03:19

If //td/div[../a[@name]] works for you, then the following should also work :

//td[a/@name]/div

This way you don't need to go back and forth -or rather down and up-. For a more specific selector, you may want to try the following :

//td/div[preceding-sibling::*[1][self::a/@name]][following-sibling::*[1][self::a/@name]]

The XPath selects div element having all the following properties :

td/div : is child of <td> element
[preceding-sibling::*[1][self::a/@name]] : preceded directly by <a> element having attribute name
[following-sibling::*[1][self::a/@name]] : followed directly by <a> element having attribute name

Parsing awful HTML: How do I recognize boundaries with xpath?

2 Answers2