6

Similar to How to deal with single quote in xpath, I want to escape single quotes. The difference is that I can't exclude the possibility that a double quote might also appear in the target string.

Goal:

Escape double and single quotes simultaneously with Xpath (in R). The target element should be used as a variable and not be hard coded like in one of the existing answers. (It should be a variable, because I am unaware of the content beforehand, it could have single quotes, double quotes or both).

Works:

library(rvest)
library(magrittr)
html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (1)}
[1] <div>Father's son</div>

Does not work:

html <- "<div>1</div><div>Fat\"her's son</div>"
target <- "Fat\"her's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (0)}
Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
  Invalid expression [1207]

Update

Non-R answers that I could try to "translate to R" are very welcome.

halfer
  • 19,824
  • 17
  • 99
  • 186
Tlatwork
  • 1,445
  • 12
  • 35

4 Answers4

7

The key here is realising that with xml2 you can write back into the parsed html with html-escaped characters. This function will do the trick. It's longer than it needs to be because I've included comments and some type checking / converting logic.

contains_text <- function(node_set, find_this)
{
  # Ensure we have a nodeset
  if(all(class(node_set) == c("xml_document", "xml_node")))
    node_set %<>% xml_children()

  if(class(node_set) != "xml_nodeset")
    stop("contains_text requires an xml_nodeset or xml_document.")

  # Get all leaf nodes
  node_set %<>% xml_nodes(xpath = "//*[not(*)]")

  # HTML escape the target string
  find_this %<>% {gsub("\"", "&quot;", .)}

  # Extract, HTML escape and replace the nodes
  lapply(node_set, function(node) xml_text(node) %<>% {gsub("\"", "&quot;", .)})

  # Now we can define the xpath and extract our target nodes
  xpath <- paste0("//*[contains(text(), \"", find_this, "\")]")
  new_nodes <- html_nodes(node_set, xpath = xpath)

  # Since the underlying xml_document is passed by pointer internally,
  # we should unescape any text to leave it unaltered
  xml_text(node_set) %<>% {gsub("&quot;", "\"", .)}
  return(new_nodes)
}

Now:

library(rvest)
library(xml2)

html %>% xml2::read_html() %>% contains_text(target)
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
html %>% xml2::read_html() %>% contains_text(target) %>% xml_text()
#> [1] "Fat\"her's son"

ADDENDUM

This is an alternative method, which is an implementation of the method suggested by @Alejandro but allows arbitrary targets. It has the merit of leaving the xml document untouched, and is a little faster than the above method, but involves the kind of string parsing that an xml library is supposed to prevent. It works by taking the target, splitting it after each " and ', then enclosing each fragment in the opposite type of quote to the one it contains before pasting them all back together with commas and inserting them into an XPath concatenate function.

library(stringr)

safe_xpath <- function(target)
{
  target                                 %<>%
  str_replace_all("\"", "&quot;&break;") %>%
  str_replace_all("'", "&apo;&break;")   %>%
  str_split("&break;")                   %>%
  unlist()

  safe_pieces    <- grep("(&quot;)|(&apo;)", target, invert = TRUE)
  contain_quotes <- grep("&quot;", target)
  contain_apo    <- grep("&apo;", target)

  if(length(safe_pieces) > 0) 
      target[safe_pieces] <- paste0("\"", target[safe_pieces], "\"")

  if(length(contain_quotes) > 0)
  {
    target[contain_quotes] <- paste0("'", target[contain_quotes], "'")
    target[contain_quotes] <- gsub("&quot;", "\"", target[contain_quotes])
  }

  if(length(contain_apo) > 0)
  {
    target[contain_apo] <- paste0("\"", target[contain_apo], "\"")
    target[contain_apo] <- gsub("&apo;", "'", target[contain_apo])
  }

  fragment <- paste0(target, collapse = ",")
  return(paste0("//*[contains(text(),concat(", fragment, "))]"))
}

Now we can generate a valid xpath like this:

safe_xpath(target)
#> [1] "//*[contains(text(),concat('Fat\"',\"her'\",\"s son\"))]"

so that

html %>% xml2::read_html() %>% html_nodes(xpath = safe_xpath(target))
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • This approach involves to change the underlying document instead of composing a correct XPath expression. – Alejandro Jan 03 '20 at 20:24
  • 1
    @Alejandro I know what you mean, but bear in mind that the xml is returned to its initial state before the end of this function, so this fact about the implementation is hidden from users. We are not in a multi-threaded environment where this kind of implementation could be problematic. I have also written a function (similar to your suggested method) that builds an xpath in pieces, but to my mind it is less elegant. It would be almost as easy to just parse the html as a single character string if you were going to do this. If ThanksGuys is interested I could include it in my answer. – Allan Cameron Jan 03 '20 at 23:22
  • I would be interested for sure, if it doesnt require too much effort. But too be fair my spec didnt make any restrictions towards (temporary/ persistent) changes to the Underlying document, so the Question is fully answered. In fact i find the idea of a temp Change with xml2 in the Underlying doc a quite clever idea. But i will Keep Alejandro´s hint in mind! – Tlatwork Jan 04 '20 at 16:26
  • Now, the second part of the answer does cover wich is the common approach for injecting a string into an embedded language: sanitizing the string with the host language. – Alejandro Jan 06 '20 at 13:58
6

Because you are using string manipulation to build your XPath expression, it's your responsibility that the expression is valid XPath. This expression:

//*[contains(.,concat('Fat"',"her's son"))]

Selects:

<div>Fat"her's son</div>

Test in here

It would be a better approach to use an XPath string variable, but it looks like R doesn't have an API for that, even using libxml.

Alejandro
  • 1,882
  • 6
  • 13
  • 1
    @ThanksGuys No problem. But that answer is wrong in essence. You should compose a syntactically correct XPath expression the same way you need a syntactically correct R program. For that you need an auxiliary R function that leaves unaltered a string without quotes, or uses the inverse if the string contains single or contains double quotes, or recursively applies the function to parts tokenized by quotes characters when the string has both single and double quotes. – Alejandro Jan 03 '20 at 20:35
4

use quote() for xpath query

library(XML)

only single quote inside string

target1 <- "Father's son"
doc1 <- XML::newHTMLDoc()
newXMLNode("div", 1, parent = getNodeSet(doc1, "//body"), doc = doc1)
newXMLNode("div", target1, parent = getNodeSet(doc1, "//body"), doc = doc1)
xpath_query1 <- paste0('//*[ contains(text(), ', '"', target1, '"', ')]')
getNodeSet(doc1, xpath_query1)

both single and double quote inside string

target2 <- "Fat\"her's son"
doc2 <- XML::newHTMLDoc()
newXMLNode("div", 1, parent = getNodeSet(doc2, "//body"), doc = doc2)
newXMLNode("div", target2, parent = getNodeSet(doc2, "//body"), doc = doc2)
xpath_query2 <- quote('//body/*[contains(.,concat(\'Fat"\',"her\'s son"))]')
getNodeSet(doc2, xpath_query2)

Output:

getNodeSet(doc1, xpath_query1)
# [[1]]
# <div>Father's son</div> 
# 
# attr(,"class")
# [1] "XMLNodeSet"

getNodeSet(doc2, xpath_query2)
# [[1]]
# <div>Fat"her's son</div> 
# 
# attr(,"class")
# [1] "XMLNodeSet"
Sathish
  • 12,453
  • 3
  • 41
  • 59
  • Thank you already thats already of great help. Maybe i did not specify it well enough. I would require to insert `target` dynamically. So something along: `xpath_query2 <- quote(paste0('//body/*[contains(.,concat(', target,'))]'))` - (this sample Code obv. Fails) - but would somethin like that be possible? – Tlatwork Dec 17 '19 at 10:35
  • yes, it is possible. Note the idea inside the xpath query - 2: single quotes are inside double quotes and double quotes are inside single quotes. They are then concatenated using the xpath function. You could create xpath query dynamically by identifying single and double quotes inside a string and handle it appropriately. You just write a generic function implementing this idea. Hope this helps. – Sathish Dec 17 '19 at 16:00
  • 1
    The problem with xpath query as far I understand is that it does not like escaping double quote. You always get an error when you try to escape double quote inside an xpath query. – Sathish Dec 17 '19 at 16:08
0

I added the cat function to the target inside the html_nodes() function call. Seems to handle both the cases. cat() also has the side-effect of printing the escaped text.

library(rvest)
library(magrittr)

html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"",cat(target),"\")]"))
#> Father's son
#> {xml_nodeset (4)}
#> [1] <html><body>\n<div>1</div>\n<div>Father's son</div>\n</body></html>
#> [2] <body>\n<div>1</div>\n<div>Father's son</div>\n</body>
#> [3] <div>1</div>\n
#> [4] <div>Father's son</div>

html <- "<div>1</div><div>Father said \"Hello!\"</div>"
target <- 'Father said "Hello!"'
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"",cat(target),"\")]"))
#> Father said "Hello!"
#> {xml_nodeset (4)}
#> [1] <html><body>\n<div>1</div>\n<div>Father said "Hello!"</div>\n</body> ...
#> [2] <body>\n<div>1</div>\n<div>Father said "Hello!"</div>\n</body>
#> [3] <div>1</div>\n
#> [4] <div>Father said "Hello!"</div>
Vishal Katti
  • 532
  • 2
  • 6
  • thank you for your answer. It seems an Output of 4 nodes is created not one. So all nodes are selected. I think the xpath-part within cat is just omitted, at least it Looks like it if you save it to a variable. – Tlatwork Jan 02 '20 at 14:27