Modifying text inside html nodes - nokogiri

Question

Let's say i have the following HTML:

<ul><li>Bullet 1.</li>
<li>Bullet 2.</li>
<li>Bullet 3.</li>
<li>Bullet 4.</li>
<li>Bullet 5.</li></ul>

What I wish to do with it, is replace any periods, question marks or exclamation marks with itself and a trailing asterisk, that is inside an HTML node, then convert back to HTML. So the result would be:

<ul><li>Bullet 1.*</li>
<li>Bullet 2.*</li>
<li>Bullet 3.*</li>
<li>Bullet 4.*</li>
<li>Bullet 5.*</li></ul>

I've been messing around with this a bit in IRB, but can't quite figure it out. here's the code i have:

 html = "<ul><li>Bullet 1.</li>
<li>Bullet 2.</li>
<li>Bullet 3.</li>
<li>Bullet 4.</li>
<li>Bullet 5.</li></ul>"

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.search("*").map { |n| n.inner_text.gsub(/(?<=[.!?])(?!\*)/, "#{$1}*") }

The array that comes back is parsed out correctly, but I'm just not sure on how to convert it back into HTML. Is there another method i can use to modify the inner_text as such?

score 12 · Accepted Answer · answered Aug 29 '11 at 19:09

12

What about this code?

doc.traverse do |x|
  if x.text?
    x.content = x.content.gsub(/(?<=[.!?])(?!\*)/, "#{$1}*")
  end
end

The traverse method does pretty much the same as search("*").each. Then you check that the node is a Nokogiri::XML::Text and, if so, change the content as you wished.

answered Aug 29 '11 at 19:09

Serabe

3,834
19
24

I do like your code better, as it's a lot cleaner and easier to read. The only problem is if the UL has a text node after it, the period will get replaced there as well. I only want this to happen if it's inside an html node of some sort. (I'm not parsing full HTML docs here). It will probably be only bullet lists and anchor tags that I'll ever run into with this project. I should of clarified my requirements there, cause otherwise your answer is perfect. – agmcleod Aug 29 '11 at 19:19
After inspecting the nodes and doing some testing, here's a solution i found that works: http://pastie.org/2450340 Thanks for helping me get here :). – agmcleod Aug 29 '11 at 19:27
Then parse as a normal HTML (`Nokogiri::HTML(html)`) and then search like this `doc.search("//li/text()").each do |x|`. Of course, the `if x.text?` is no longer needed. – Serabe Aug 29 '11 at 19:39

score 1 · Answer 2 · edited May 23 '17 at 11:53

Thanks to the post here Nokogiri replace tag values, I was able to modify it a bit and figure it out.

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.search("*").each do |node|
  dummy = node.add_previous_sibling(Nokogiri::XML::Node.new("dummy", doc))
  dummy.add_previous_sibling(Nokogiri::XML::Text.new(node.to_s.gsub(/(?<=[.!?])(?!\*)/, "#{$1}*"), doc))
  node.remove
  dummy.remove
end

puts doc.to_html.gsub("&lt;", "<").gsub("&gt;", ">")

Modifying text inside html nodes - nokogiri

2 Answers2

Linked