I'm trying to replace instances of a unique string across a bunch of files by scanning the content of the nodes with Nokogiri and then performing a gsub
. I'm keeping part of the string in place, and transforming it into an anchor tag. However, the majority of the nodes have various forms of markup in the contents, and aren't just straightforward strings. For example, let's say I have a file like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
<head>
<title>Title</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div>
<p class="header"><<2>>Header</p>
<p class="paragraph">
<p class="text_style">Lorem ipsum blah blah blah. <<3>> Here is more content. <span class="style">Preserve this.</span> Blah blah extra text.</p>
</div>
</body>
</html>
There are numbers throughout the document, surrounded by <<
and >>
. I want to take the value of the number and transform it into a tag like this: <a id='[#]'/>
, but I want to preserve the HTML markup of other elements within the same section, i.e. <span class="style">Preserve this.</span>
.
Here's everything I've tried:
file = File.open("file.xhtml") {|f| Nokogiri::XML(f)}
file.xpath("//text()").each { |node|
if node.text.match(/<<([^_]*)>>/)
new_content = node.text.gsub(/<<([^_]*)>>/,"<a id=\"\\1\"/>")
node.parent.inner_html = new_content
end
}
The gsub
works correctly, but because it uses the .text
method, any markup is ignored and effectively wiped out. In this case, the <span class="style">Preserve this.</span>
part is completely removed. (FYI, I use the .parent
method because if I just do node.inner_html = new_content
I get this error: add_child_node': cannot reparent Nokogiri::XML::Element there (ArgumentError)
.)
If I do this instead:
new_content = node.text.gsub(/<<([^_]*)>>/,"<a id=\"\\1\"/>")
node.content = new_content
the characters aren't properly escaped: the file ends up with <a id="3"/>
instead of <a id="3"/>
.
I tried using the CSS methods instead like so:
file.xpath("*").each { |node|
if node.inner_html.match(/<<([^_]*)>>/)
new_content = node.inner_html.gsub(/<<([^_]*)>>/,"<a id=\"\\1\"/>")
node.inner_html = new_content
end
}
The gsub
works, the markup is preserved, and the replaced tags are escaped properly. But the <head>
and <body>
tags are removed, which results in an invalid file:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Title</title>
<link href="style.css" rel="stylesheet" type="text/css"/>
<div>
<p class="header"><a id="2"/>Header</p>
<p class="paragraph">
</p><p class="text_style">Lorem ipsum blah blah blah. <a id="3"/> Here is more content. <span class="style">Preserve this.</span> Blah blah extra text. </p>
</div>
</html>
I suspect it has something to do with the fact that I'm iterating over all the nodes (file.css("*")
), which is also redundant, since a parent node is scanned in addition to its children.
I've scoured the web but can't find any solutions for this. I just want to be able to swap out unique text while maintaining markup and having it be correctly encoded. Is there something very obvious that I'm missing here?