0

Hi we are running this code and want to loop over rows. I've got something completely wrong how Pythn handles the xpath selectors. It works in my Chrome xpath browsers, just not in python.

  • we capture a data table in table this works
  • then we grab all underlying rows in TR

My question is: how can I grab the tbody/tr's and color properly and most logically? I have tried // and ./ and / ...

  1. For color_rows = table.xpath('/tbody/tr') I would expect to be able to use /tbody/tr directly because the data is directly under the table. Somehow I have to use // to get it to work, why?

  2. For color = color_row.xpath('/td[1]/b/text()').get().strip() I would expect to be able to use /td[1]/b/text() directly because the data is directly under the tr. Somehow I have to use // to get it to work, why?

     table = response.xpath('//div[@class="content"]//table[contains(@class,"table")]')
     color_rows = table.xpath('/tbody/tr')
     for color_row in color_rows:
         color = color_row.xpath('/td[1]/b/text()').get().strip()
    

Our html data looks like this

<table class="table">
    <thead>
        <tr>
            <th id="ctl00_cphCEShop_colColore" class="text-left" colspan="2">Colore</th>
                <th>S</th>
                <th>M</th>
                <th>L</th>
            </tr>
    </thead>
    
    <tbody>
        <tr>
            <td id="x">
                <b>White</b>
                <input type="hidden" name="data" value="3230/201">
            </td>
            <td id="avail">
                Avail:
            </td>
            <td id="1">
                <div>
                    <input name="cell" type="text" class="form-control">
                    <div class="text-center">179</div>
                </div>
            </td>
            <td id="2">
                <div>
                    <input name="cell" type="text" class="form-control">
                    <div class="text-center">360</div>
                </div>
            </td>
etc etc
snh_nl
  • 2,877
  • 6
  • 32
  • 62
  • Maybe this answer (https://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the) can provide some help. tbody tags often are generated by browsers, while they are absent in the original HTML source (and the crawlers read the last one). – Stefano Fiorucci - anakin87 Nov 05 '20 at 11:26

1 Answers1

-1

When you want to locate node somewhere in DOM you need to use // (exception is root node - you can use one slash only '/html'):

response.xpath('//tbody/tr')

When you want to start search from the node you've already found you should use either ./ for child or .// for descendant:

table.xpath('./tbody/tr') 

Note the dot in the beginning of XPath expression that points on context node

P.S. I also would recommend not to use tbody in your XPath as it might not be present in page source code but be added by Browser while rendering DOM !! Always inspect the real HTML and the dom there.

snh_nl
  • 2,877
  • 6
  • 32
  • 62
JaSON
  • 4,843
  • 2
  • 8
  • 15
  • could tbody be the culprit then? and why I am suddenly so uncertain about my xpath skills? – snh_nl Nov 05 '20 at 12:29
  • Can you elaborate a little more on context node? for `color_row` in the loop is the context "in TR" or "under TR"? For `tbody` and `td` would I then use `./tbody` and `./td` if the context is correct? I would rather refrain from using `//` – snh_nl Nov 05 '20 at 12:32
  • @snh_nl , in `table.xpath('/tbody/tr')` the context node is `` that you've declared as `table` variable. So it'd be better to use `table.xpath('./tr')`. On each iteration the context node is `` (`color_row`). Use `color = color_row.xpath('./td[1]/b/text()').get().strip()`. Some explanations about context node can be found [here](https://stackoverflow.com/questions/1022345/current-node-vs-context-node-in-xslt-xpath)
    – JaSON Nov 05 '20 at 12:36
  • There was no tbody in the underlying HTML. The browser added it. And because I just stated Python and xpath I am still a little wobbly on the syntax ... and this threw me off. – snh_nl Nov 05 '20 at 20:26