2

I'm using Python 3.6 to process a chunk of HTML, the issue I'm having is that the code below for the loop is working but the atag.xpath query is searching the whole HTML source and returning all four tag values for data-size.

What I'm trying to do is that when PAGE_RAWis processed for the for loop that for every instance of a DIV containing a class of item that it will find the child DIV with a class of padding and pull out the data-size attribute for that one tag and not all the tags if finds in the HTML source.

HTML

<div class="item">
    <div class="padding" data-size="12"></div>
</div>
<div class="item">
    <div class="padding" data-size="13"></div>
</div>
<div class="item">
    <div class="padding" data-size="14"></div>
</div>
<div class="item">
    <div class="padding" data-size="15"></div>
</div>

Code

import lxml.html as LH
...

PAGE_RAW = driver.page_source
PAGE_RAW = LH.fromstring(PAGE_RAW)

for atag in PAGE_RAW.xpath("//div[contains(@class, 'item')]"):
    data = atag.xpath("//div[contains(@class, 'padding')]/@data-size")
llanato
  • 2,508
  • 6
  • 37
  • 59

1 Answers1

5

The problem you're facing here is that in your second xpath, the // is telling it to search anywhere in the document (it doesn't matter if the current node is a specific div, it always searches from start).

To find any nodes under the current node, replace // with .// (the . indicates that the search starts with the current node, not the root).

import lxml.html as LH
...

PAGE_RAW = driver.page_source
PAGE_RAW = LH.fromstring(PAGE_RAW)

for atag in PAGE_RAW.xpath("//div[contains(@class, 'item')]"):
    data = atag.xpath(".//div[contains(@class, 'padding')]/@data-size")
araraonline
  • 1,502
  • 9
  • 14