1

I need to scrape some data off tags in a page which further has more DOM elements. The articles are repeated and they have an xpath as:

//*[@id="post_page"]/div/div[2]/main/div/div/div/div[2]/div[2]/div/div[3]/div/article[N]

where 'N' represents the Nth article. And within each article, the xpath for the element I'm interested in is:

/div/div/div/div/div/div/div[3]/div[1]/button[1]/span

The first thing I did was to use

Elements = driver.find_elements(By.XPATH, <first_path>) 

And it fetched me all the articles in the page. PS: I did not add [N] because that would only fetch a specific article, and I'm interested in all.
Then, for each element in the list, I used find_element using the second path as follows:

for elem in Elements:
    Required.append(elem.find_element(By.XPATH, <second_path>))

Where Required is a list in which I'll be storing the data. And this is where I got the element does not exist error. I also tried adding a . before <second_path> but that didn't solve the issue either.
The complete xpath of the element is:

//*[@id="post_page"]/div/div[2]/main/div/div/div/div[2]/div[2]/div/div[3]/div/article[N]/div/div/div/div/div/div/div[3]/div[1]/button[1]/span

And the CSS Selector for the same is:

#post_page > div > div._UuSG.w77Za._21rSD._3SBW4 > main > div > div > div > div._UuSG._ayWa._3dGg1.Vlb1o._1vyTb > div._UuSG.qzupC._3cqkW > div > div:nth-child(3) > div > article:nth-child(N) > div > div > div > div > div > div > div._UuSG._3VzCT._2FoTG > div._UuSG._3dGg1._2VJFi._2h1-g > button:nth-child(1) > span

I also tried an approach using a loop where I increment a counter variable and use that as N for the whole xpath, but that didn't seem to work either. Got the same error.
Any help would be greatly appreciated.


EDIT[1] The last span has the following class names:

<span class="_UuSG _3_54N a8-QN _2cSLK L4pn5 RiX17">Stuff I need</span>

Which are unique (collectively) in the page. This information might be relevant somehow.

Binayak
  • 79
  • 5
  • Is there an actual url we can work with and a the values of some of the items to retrieve? Also, any other relevant code to generate the html as you are working with it, including import statements. – QHarr Jun 12 '21 at 03:09
  • @QHarr Sorry, I can't share a url here. And regarding the values, I can only tell this much that matching the text content with some values will not work as some of the strings in the expected result are repeated heavily in the page (and they don't belong to the required element either) – Binayak Jun 12 '21 at 04:04

2 Answers2

1

I think I know your problem. When you do

Elements = driver.find_elements(By.XPATH, <first_path>) 

you have already found all the elements you need here. So in your for loop, just use elem, no more "finding" is needed.

for elem in Elements:
    Required.append(elem)
C. Peck
  • 3,641
  • 3
  • 19
  • 36
  • I do have all the elements, but as you can see from `````` that each element has a highly nested ```div``` structure and I can't work with the entire thing. I specifically need to tap into the ```span``` at the end of it. – Binayak Jun 12 '21 at 02:51
  • I see. We'll need to see your whole HTML in order to figure this out. There is almost certainly a better way to locate those elements than the full xpath. – C. Peck Jun 12 '21 at 02:55
  • That's why I left the CSS Selector. Maybe there's a way to fetch it using the class name which I am not aware of. But I know that the classes for the last ```span``` are unique to it. – Binayak Jun 12 '21 at 03:06
  • If that span element ALWAYS has those classes, and it's unique, you could use `driver.find_element_by_css_selector('._UuSG._3_54N.a8-QN._2cSLK.L4pn5.RiX17')` – C. Peck Jun 12 '21 at 19:50
  • actually, i think i was wrong, the classes are unique to the span but they are also used by some other elements too in addition to other classes. Like, if the span i want has the classes ```.a .b .c``` and some other element has the classes ```.a .b .c .d .e```. using the approach you suggested i'm fetching extra elements too. any way to filter it out? btw, currently i did find a workaround by performing ```.get_attribute("innetText")``` on all the articles in the page and then formatting the text result. it's not the best way i know, but it works. for now. – Binayak Jun 13 '21 at 12:55
0

I would use .// to select using descendent-or-self axis starting from the current node (. means current node). You have already tried with ./, which is pretty close.

xpath ".//span", what does the dot mean?
What is meaning of .// in XPath?

K. B.
  • 3,342
  • 3
  • 19
  • 32