0

I am having a bit of a problem of scraping a table-heavy page with DOMXpath.

The layout is really ugly, meaning I am trying to get content out of a table within a table within a table. Using Firebug FirePath I am getting for the table element the following path:

 html/body/table/tbody/tr[3]/td/table[1]/tbody/tr[2]/td[1]/table[1]/tbody/tr[3]/td[4]

Now, after endless experimenting I found out, that with a stand alone table, I need to remove the "tbody" tag to make it work. But this doesn't seem to be enough for tables within tables. So my question is how do I best get content out of tables within tables within tables?

I uploaded the file which I am trying to scrape here:1

Community
  • 1
  • 1
Paul
  • 51
  • 1
  • 1
  • 10
  • Work out the path down to the desired elements yourself. Don't trust Firebug as (as you've seen) it doesn't reflect the original document precisely. We can't help you much without seeing the "really ugly" HTML. – salathe Dec 13 '12 at 23:09
  • @salathe I tried working on it some more but just can't get it to work. I uploaded it now to http://www.pjh.org/se/XpathProblemFile.zip - maybe you can give it quick look. – Paul Dec 15 '12 at 22:46
  • I guess now we need to know what exactly you're trying to scrape out of that file. It might be worth looking at other, neater, ways of getting at the content that you need; for example, do the table cells have specific "class" attributes that you can target, or some pattern to their content, rather than just trying to use something like the Firebug path. – salathe Dec 15 '12 at 23:10
  • @salathe I'm trying to get GRABME1, GRABME2 and GRABME3. The problem I see is, that the same class attributes are appearing multiple times, i.e. if I get the Xpath right it might be easier to adjust. – Paul Dec 16 '12 at 10:33

2 Answers2

1

i have gone through with the same problem as yours scrapping a source of complicated and not well formatted html where i want to get the values in a table inside another tables..

i came with the approach of eyeing the part that i want to get with some series of function like this:

function parse_html() {//gets a specific part of the table i chose to extract the contents
    $query = $xpath->query('//tr[@data-eventid]/@data-eventid'); //gets the table i want
    $this->parse_table();
}
function parse_table() {//
    $query = $xpath->query('//tr[@data-eventid="405412"]/td[@class="impact"]/span[@title]/@title');...etc//extracts the content of the table
    $this->parseEvaluate();
} 
function parseEvaluate(){
    ...verifying values if correct
}

just giving the idea..

Vainglory07
  • 5,073
  • 10
  • 43
  • 77
0

How about:

//*[contains(text(),"GRABME")]

I know that's probably not what you want, but you get the idea. Identify a pattern and use that pattern to construct the xpath.

pguardiario
  • 53,827
  • 19
  • 119
  • 159