Simultaneously escape double and single quotes in Xpath

Question

Similar to How to deal with single quote in xpath, I want to escape single quotes. The difference is that I can't exclude the possibility that a double quote might also appear in the target string.

Goal:

Escape double and single quotes simultaneously with Xpath (in R). The target element should be used as a variable and not be hard coded like in one of the existing answers. (It should be a variable, because I am unaware of the content beforehand, it could have single quotes, double quotes or both).

Works:

library(rvest)
library(magrittr)
html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (1)}
[1] <div>Father's son</div>

Does not work:

html <- "<div>1</div><div>Fat\"her's son</div>"
target <- "Fat\"her's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (0)}
Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
  Invalid expression [1207]

Update

Non-R answers that I could try to "translate to R" are very welcome.

Allan Cameron · Accepted Answer · 2020-01-04T18:48:15.717

The key here is realising that with xml2 you can write back into the parsed html with html-escaped characters. This function will do the trick. It's longer than it needs to be because I've included comments and some type checking / converting logic.

contains_text <- function(node_set, find_this)
{
  # Ensure we have a nodeset
  if(all(class(node_set) == c("xml_document", "xml_node")))
    node_set %<>% xml_children()

  if(class(node_set) != "xml_nodeset")
    stop("contains_text requires an xml_nodeset or xml_document.")

  # Get all leaf nodes
  node_set %<>% xml_nodes(xpath = "//*[not(*)]")

  # HTML escape the target string
  find_this %<>% {gsub("\"", "&quot;", .)}

  # Extract, HTML escape and replace the nodes
  lapply(node_set, function(node) xml_text(node) %<>% {gsub("\"", "&quot;", .)})

  # Now we can define the xpath and extract our target nodes
  xpath <- paste0("//*[contains(text(), \"", find_this, "\")]")
  new_nodes <- html_nodes(node_set, xpath = xpath)

  # Since the underlying xml_document is passed by pointer internally,
  # we should unescape any text to leave it unaltered
  xml_text(node_set) %<>% {gsub("&quot;", "\"", .)}
  return(new_nodes)
}

Now:

library(rvest)
library(xml2)

html %>% xml2::read_html() %>% contains_text(target)
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
html %>% xml2::read_html() %>% contains_text(target) %>% xml_text()
#> [1] "Fat\"her's son"

ADDENDUM

This is an alternative method, which is an implementation of the method suggested by @Alejandro but allows arbitrary targets. It has the merit of leaving the xml document untouched, and is a little faster than the above method, but involves the kind of string parsing that an xml library is supposed to prevent. It works by taking the target, splitting it after each " and ', then enclosing each fragment in the opposite type of quote to the one it contains before pasting them all back together with commas and inserting them into an XPath concatenate function.

library(stringr)

safe_xpath <- function(target)
{
  target                                 %<>%
  str_replace_all("\"", "&quot;&break;") %>%
  str_replace_all("'", "&apo;&break;")   %>%
  str_split("&break;")                   %>%
  unlist()

  safe_pieces    <- grep("(&quot;)|(&apo;)", target, invert = TRUE)
  contain_quotes <- grep("&quot;", target)
  contain_apo    <- grep("&apo;", target)

  if(length(safe_pieces) > 0) 
      target[safe_pieces] <- paste0("\"", target[safe_pieces], "\"")

  if(length(contain_quotes) > 0)
  {
    target[contain_quotes] <- paste0("'", target[contain_quotes], "'")
    target[contain_quotes] <- gsub("&quot;", "\"", target[contain_quotes])
  }

  if(length(contain_apo) > 0)
  {
    target[contain_apo] <- paste0("\"", target[contain_apo], "\"")
    target[contain_apo] <- gsub("&apo;", "'", target[contain_apo])
  }

  fragment <- paste0(target, collapse = ",")
  return(paste0("//*[contains(text(),concat(", fragment, "))]"))
}

Now we can generate a valid xpath like this:

safe_xpath(target)
#> [1] "//*[contains(text(),concat('Fat\"',\"her'\",\"s son\"))]"

so that

html %>% xml2::read_html() %>% html_nodes(xpath = safe_xpath(target))
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>

This approach involves to change the underlying document instead of composing a correct XPath expression. — Alejandro, Jan 03 '20 at 20:24
@Alejandro I know what you mean, but bear in mind that the xml is returned to its initial state before the end of this function, so this fact about the implementation is hidden from users. We are not in a multi-threaded environment where this kind of implementation could be problematic. I have also written a function (similar to your suggested method) that builds an xpath in pieces, but to my mind it is less elegant. It would be almost as easy to just parse the html as a single character string if you were going to do this. If ThanksGuys is interested I could include it in my answer. — Allan Cameron, Jan 03 '20 at 23:22
I would be interested for sure, if it doesnt require too much effort. But too be fair my spec didnt make any restrictions towards (temporary/ persistent) changes to the Underlying document, so the Question is fully answered. In fact i find the idea of a temp Change with xml2 in the Underlying doc a quite clever idea. But i will Keep Alejandro´s hint in mind! — Tlatwork, Jan 04 '20 at 16:26
Now, the second part of the answer does cover wich is the common approach for injecting a string into an embedded language: sanitizing the string with the host language. — Alejandro, Jan 06 '20 at 13:58

score 6 · Answer 2 · answered Dec 16 '19 at 22:55

6

Because you are using string manipulation to build your XPath expression, it's your responsibility that the expression is valid XPath. This expression:

//*[contains(.,concat('Fat"',"her's son"))]

Selects:

<div>Fat"her's son</div>

Test in here

It would be a better approach to use an XPath string variable, but it looks like R doesn't have an API for that, even using libxml.

answered Dec 16 '19 at 22:55

Alejandro

1,882
6
13

1

@ThanksGuys No problem. But that answer is wrong in essence. You should compose a syntactically correct XPath expression the same way you need a syntactically correct R program. For that you need an auxiliary R function that leaves unaltered a string without quotes, or uses the inverse if the string contains single or contains double quotes, or recursively applies the function to parts tokenized by quotes characters when the string has both single and double quotes. – Alejandro Jan 03 '20 at 20:35

Sathish · Answer 3 · 2019-12-17T00:36:43.857

4

use quote() for xpath query

library(XML)

only single quote inside string

target1 <- "Father's son"
doc1 <- XML::newHTMLDoc()
newXMLNode("div", 1, parent = getNodeSet(doc1, "//body"), doc = doc1)
newXMLNode("div", target1, parent = getNodeSet(doc1, "//body"), doc = doc1)
xpath_query1 <- paste0('//*[ contains(text(), ', '"', target1, '"', ')]')
getNodeSet(doc1, xpath_query1)

both single and double quote inside string

target2 <- "Fat\"her's son"
doc2 <- XML::newHTMLDoc()
newXMLNode("div", 1, parent = getNodeSet(doc2, "//body"), doc = doc2)
newXMLNode("div", target2, parent = getNodeSet(doc2, "//body"), doc = doc2)
xpath_query2 <- quote('//body/*[contains(.,concat(\'Fat"\',"her\'s son"))]')
getNodeSet(doc2, xpath_query2)

Output:

getNodeSet(doc1, xpath_query1)
# [[1]]
# <div>Father's son</div> 
# 
# attr(,"class")
# [1] "XMLNodeSet"

getNodeSet(doc2, xpath_query2)
# [[1]]
# <div>Fat"her's son</div> 
# 
# attr(,"class")
# [1] "XMLNodeSet"

edited Dec 17 '19 at 00:36

answered Dec 17 '19 at 00:30

Sathish

12,453
3
41
59

Thank you already thats already of great help. Maybe i did not specify it well enough. I would require to insert `target` dynamically. So something along: `xpath_query2 <- quote(paste0('//body/*[contains(.,concat(', target,'))]'))` - (this sample Code obv. Fails) - but would somethin like that be possible? – Tlatwork Dec 17 '19 at 10:35
yes, it is possible. Note the idea inside the xpath query - 2: single quotes are inside double quotes and double quotes are inside single quotes. They are then concatenated using the xpath function. You could create xpath query dynamically by identifying single and double quotes inside a string and handle it appropriately. You just write a generic function implementing this idea. Hope this helps. – Sathish Dec 17 '19 at 16:00
1

The problem with xpath query as far I understand is that it does not like escaping double quote. You always get an error when you try to escape double quote inside an xpath query. – Sathish Dec 17 '19 at 16:08

score 0 · Answer 4 · answered Jan 02 '20 at 13:46

I added the cat function to the target inside the html_nodes() function call. Seems to handle both the cases. cat() also has the side-effect of printing the escaped text.

library(rvest)
library(magrittr)

html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"",cat(target),"\")]"))
#> Father's son
#> {xml_nodeset (4)}
#> [1] <html><body>\n<div>1</div>\n<div>Father's son</div>\n</body></html>
#> [2] <body>\n<div>1</div>\n<div>Father's son</div>\n</body>
#> [3] <div>1</div>\n
#> [4] <div>Father's son</div>

html <- "<div>1</div><div>Father said \"Hello!\"</div>"
target <- 'Father said "Hello!"'
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"",cat(target),"\")]"))
#> Father said "Hello!"
#> {xml_nodeset (4)}
#> [1] <html><body>\n<div>1</div>\n<div>Father said "Hello!"</div>\n</body> ...
#> [2] <body>\n<div>1</div>\n<div>Father said "Hello!"</div>\n</body>
#> [3] <div>1</div>\n
#> [4] <div>Father said "Hello!"</div>

thank you for your answer. It seems an Output of 4 nodes is created not one. So all nodes are selected. I think the xpath-part within cat is just omitted, at least it Looks like it if you save it to a variable. — Tlatwork, Jan 02 '20 at 14:27

Simultaneously escape double and single quotes in Xpath

Update

4 Answers4

Linked

Related