1

I'm trying to replace instances of a unique string across a bunch of files by scanning the content of the nodes with Nokogiri and then performing a gsub. I'm keeping part of the string in place, and transforming it into an anchor tag. However, the majority of the nodes have various forms of markup in the contents, and aren't just straightforward strings. For example, let's say I have a file like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
    <head>
        <title>Title</title>
        <link href="style.css" rel="stylesheet" type="text/css" />
    </head>
    <body>
        <div>
            <p class="header">&lt;&lt;2&gt;&gt;Header</p>
            <p class="paragraph">
            <p class="text_style">Lorem ipsum blah blah blah. &lt;&lt;3&gt;&gt; Here is more content. <span class="style">Preserve this.</span> Blah blah extra text.</p>
        </div>
    </body>
</html>

There are numbers throughout the document, surrounded by &lt;&lt; and &gt;&gt;. I want to take the value of the number and transform it into a tag like this: <a id='[#]'/>, but I want to preserve the HTML markup of other elements within the same section, i.e. <span class="style">Preserve this.</span>.

Here's everything I've tried:

file = File.open("file.xhtml") {|f| Nokogiri::XML(f)}

file.xpath("//text()").each { |node|
    if node.text.match(/<<([^_]*)>>/)
        new_content = node.text.gsub(/<<([^_]*)>>/,"<a id=\"\\1\"/>")
        node.parent.inner_html = new_content
    end
}

The gsub works correctly, but because it uses the .text method, any markup is ignored and effectively wiped out. In this case, the <span class="style">Preserve this.</span> part is completely removed. (FYI, I use the .parent method because if I just do node.inner_html = new_content I get this error: add_child_node': cannot reparent Nokogiri::XML::Element there (ArgumentError).)

If I do this instead:

    new_content = node.text.gsub(/<<([^_]*)>>/,"<a id=\"\\1\"/>")
    node.content = new_content

the characters aren't properly escaped: the file ends up with &lt;a id="3"/&gt; instead of <a id="3"/>.

I tried using the CSS methods instead like so:

file.xpath("*").each { |node|
    if node.inner_html.match(/&lt;&lt;([^_]*)&gt;&gt;/)
        new_content = node.inner_html.gsub(/&lt;&lt;([^_]*)&gt;&gt;/,"<a id=\"\\1\"/>")
        node.inner_html = new_content
    end
}

The gsub works, the markup is preserved, and the replaced tags are escaped properly. But the <head> and <body> tags are removed, which results in an invalid file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Title</title>
        <link href="style.css" rel="stylesheet" type="text/css"/>
        <div>
            <p class="header"><a id="2"/>Header</p>
            <p class="paragraph">
            </p><p class="text_style">Lorem ipsum blah blah blah. <a id="3"/> Here is more content. <span class="style">Preserve this.</span> Blah blah extra text. </p>    
    </div>
</html>

I suspect it has something to do with the fact that I'm iterating over all the nodes (file.css("*")), which is also redundant, since a parent node is scanned in addition to its children.

I've scoured the web but can't find any solutions for this. I just want to be able to swap out unique text while maintaining markup and having it be correctly encoded. Is there something very obvious that I'm missing here?

lumos
  • 161
  • 12
  • This is a horrible way to find nodes to modify. Instead use selectors to find the exact nodes you want to change then change their `text`. If you are responsible for the markup then put some useful class information in the nodes that makes it easy to pinpoint them. Basically you're creating a templating engine. – the Tin Man Jan 29 '20 at 23:16
  • @theTinMan Yes, it is not the way I want to find nodes, which is why I asked the question in the first place. The issue is that the unique text I'm searching for can appear anywhere, in any node, with any selector—I don't know what the "exact nodes" are, which is the point. I provided a simple HTML doc for reference, but the actual content is exported through InDesign, and is relatively complex, long, and unpredictable. – lumos Jan 30 '20 at 04:40
  • Are the classes always defined? And, why is it defined as XML but it's HTML? The problem with searching for text is, `text` or `content` or any selector looking for text will find it in children of that node, even if it's nested deeply. You might want to ask this on the Nokogiri support groups. https://groups.google.com/forum/#!forum/nokogiri-talk – the Tin Man Feb 02 '20 at 23:25

1 Answers1

3

It looks like this works pretty well:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
    <head>
        <title>Title</title>
        <link href="style.css" rel="stylesheet" type="text/css" />
    </head>
    <body>
        <div>
            <p class="header">&lt;&lt;2&gt;&gt;Header</p>
            <p class="paragraph">
            <p class="text_style">Lorem ipsum. &lt;&lt;3&gt;&gt; more content. <span class="style">Preserve this.</span> extra text.</p>
        </div>
    </body>
</html>
EOT

doc.search("//text()[contains(.,'<<')]").each do |node|
  node.replace(node.content.gsub(/<<(\d+)>>/, '<a id="[\1]" />'))
end

Which results in:

puts doc.to_html

# >> <html>
# >>     <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >>         <title>Title</title>
# >>         <link href="style.css" rel="stylesheet" type="text/css">
# >>     </head>
# >>     <body>
# >>         <div>
# >>             <p class="header"><a id="[2]"></a>Header</p>
# >>             <p class="paragraph">
# >>             <p class="text_style">Lorem ipsum. <a id="[3]"></a> more content. <span class="style">Preserve this.</span> extra text.</p>
# >>         </p>
# >>     </div>
# >> </body>
# >> </html>

Nokogiri is adding the

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

line, probably because the markup is defined as XML.

The selector "//text()[contains(.,'<<')]" is only looking for text nodes containing '<<'. You might want to modify that to make it more specific if it's possible to result in false positives. See "XPath: using regex in contains function" for the syntax.

replace is performing the trick; You were trying to modify a Nokogiri::XML::Text node to contain an <a.../>, but it can't, the < and > must be encoded. Changing the node to a Nokogiri::XML::Element, which is what Nokogiri defaults <a id="[2]"> to, lets it store it as you want.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303