1

Assuming I am using recursive loop for resilient discovery and location of DOM element(s) that will work across semi-structured and semi-uniform HTML DOM documents from a website.

For example, when crawling links on a website and coming across small variations in it's xpath location. Resilience is desired to allow flexible un-interrupted crawling.

1) I know that I want a link which is located on a certain region of the page distinguishable from the rest (ex. menu's footer, header etc.)

2) It's distinguishable since it appears to be inside a table and pargraph or container.

3) There can be an acceptable level of unexpected parents or children before this desired link mentioned in 1) but I don't know what. More unexpected elements would mean departure from 1).

4) Identifying via element's id and class or any other unique attribute value is not desired.

I think the following xpath should sum up:

/`/p/table/tr/td/a`

on some pages there is variations to the xpath but it still qualifies as 1) desired link

//p/div/table/tr/td/a or //p/div/span/span/table/tr/td/b/a

I have used indentation to mimic each loop iteration (

(should I use plurral or singular ? children vs child. parents vs parent. I think singular makes sense as the immediate parent or child is of concern here.)

TOP DOWN SEARCHING:

how many p's are there ?
 how many these p's have table as child ? If none, search next sub level. 
   how many these table's have tr as child ? If none, search next sub level.
     how many these tr have td as child ? If none, search next sub level.
      how many these td have a as child ? 

BOTTOM UP SEARCHING:

how many a's are there ?
 how many of these a's have td as parent ? If none, look up to the next super level.
  how many of these td have tr as parent ? If none, look up to the next super level.
   how many of these tr have table as parent ? If none, look up to the next super level.
    how many of these table have p as a parent ? If none, look up to the next super level.

Does it matter if it's top down or bottom up ? I feel that top down is useless and inefficient if it turns by the end of the loop, the desired anchor link is not found.

I think I would also measure how many unexpected parents or children were discovered in each iteration of the loop and would compare to a preset constant that I am comfortable with ex) say no more than 2. If there are 3 or more unexpected parents or children iterations before the discovery of my desired anchor link, I would assume it's not what I am looking for.

Is this the correct approach ? This is just something that I came up with on top of my head. I apologize if this problem is not clear, I have tried my best. I would love to get some input on this algorithm.

heysup
  • 225
  • 1
  • 2
  • 6
  • I really don't understand the question, but if it's about performance, I would say that streaming (like SAX) is the way to go: when `p` mark is found enter in searching for `a` state. –  Dec 30 '10 at 23:46
  • the question is basically locating html elements which is resilient to small changes in it's location. For example, to the user an xpath representing an element viewed in browser is unaware of small changes in it's xpath. But to a spider, slightest change in xpath is interpretated as a whole new eleemnt. I don't understand what you mean by streaming or SAX. could you clarify on what you meant by `**p** mar is found enter in search for **a** state` ? Do you mean search p, and search for anchor inside it ? – heysup Dec 30 '10 at 23:54
  • Good question, +1. See my answer for a single XPath expression that selects exactly the `a` elements that satisfy your requirements. :) – Dimitre Novatchev Dec 31 '10 at 02:56

1 Answers1

0

Seems that you want something like:

//p//table//a

If you have limitations for the number of intermediate elements in the path, say not more than 2, then the above would be modified to:

//p[not(ancestor::*[3])]
      //table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
               /tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]

This selects all a elements whose parent or grand-parent is td, whose parent is a tr, whose parent is a table, whose parent or grandparent is a p that has less than 3 ancesstor - elements.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431