2

I'm strugling with this simple code.

<div id="post_message_975824" class="alt3">
   <div class="quote">
      some unwanted text 
   </div>
   the text to get <abr>ABR</abr> text to get
</div>

and I want to get this worked:

xpath = "//*[contains(@id, 'post_message_') and not(contains(@class,'quote'))]"

but this fails. I was trying to use some another query but not sure what I'm doing wrong?

EDIT

I found his code works: xpath = "//*[contains(@id,'post_message_')//div[not(contains(@class,'quote'))]"

but it doesn't select the desired text when there's no quote subclass in the html.

The idea is to get all text from all subnodes also but not from those restricted.

Peter.k
  • 1,475
  • 23
  • 40

2 Answers2

2

Try this xpath :

//div[contains(@id,'post_message_')]/text() | //div[contains(@id,'post_message_')]/*[not(contains(@class,'quote'))]/text()

The first part of xpath //div[contains(@id,'post_message_')]/text() gives the text under the parent div i.e. <div id="post_message_975824" class="alt3">

The second part of xpath //div[contains(@id,'post_message_')]/*[not(contains(@class,'quote'))]/text() gives the text under all its child nodes only if the child doesn't contain an attribute called class with value quote

The result on your example is :

   the text to get 
ABR
 text to get
SomeDude
  • 13,876
  • 5
  • 21
  • 44
0

Why not just remove all the nodes you don't want?

library(xml2)

doc <- read_xml('<div id="post_message_975824" class="alt3">
   <div class="quote">
      some unwanted text 
   </div>
   the text to get <abr>ABR</abr> text to get
</div>')

xml_find_all(doc, ".//div[@class='quote']") %>% xml_remove()
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205