How to find multiple substring matches within a string, alter substring enclosures

Question

I am trying to parse a string of HTML with ruby, this string contains multiple <pre></pre> tags, I need to find and encode all < and > brackets in between each of these elements.

Example: 

string_1_pre = "<pre><h1>Welcome</h1></pre>"

string_2_pre = "<pre><h1>Welcome</h1></pre><pre><h1>Goodbye</h1></pre>"

def clean_pre_code(html_string)
 matched = html_string.match(/(?<=<pre>).*(?=<\/pre>)/)
 cleaned = matched.to_s.gsub(/[<]/, "&lt;").gsub(/[>]/, "&gt;")
 html_string.gsub(/(?<=<pre>).*(?=<\/pre>)/, cleaned)
end

clean_pre_code(string_1_pre) #=> "<pre>&lt;h1&gt;Welcome&lt;/h1&gt;</pre>"
clean_pre_code(string_2_pre) #=> "<pre>&lt;h1&gt;Welcome&lt;/h1&gt;&lt;/pre&gt;&lt;pre&gt;&lt;h1&gt;Goodbye&lt;/h1&gt;</pre>"

This works as long as html_string contains only one <pre></pre> element, but not if there are multiple.

I would be open to a solution that utilizes Nokogiri or similar, but couldn't figure how to make it do what I want.

Please let me know if you need any additional context.

Update: This is possible only with Nokogiri, see accepted answer.

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — CAustin, Feb 02 '19 at 00:56
Using match makes it want to match the entire string. So, it won't match substrings. I believe yu'd have to use search for that. — , Feb 02 '19 at 01:36
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Nick, Feb 02 '19 at 02:40

score 1 · Accepted Answer · answered Feb 02 '19 at 02:21

1

@zstrad44 Yes you can get it done by using Nokogiri. Here is my version of code which I develop from your version and this will give you the result you want for multi pre tags in the string.

def clean_pre_code(html_string)
  doc = Nokogiri::HTML(html_string)
  all_pre = doc.xpath('//pre')
  res = ""
  all_pre.each do |pre|
    pre = pre.to_html
    matched = pre.match(/(?<=<pre>).*(?=<\/pre>)/)
    cleaned = matched.to_s.gsub(/[<]/, "&lt;").gsub(/[>]/, "&gt;")
    res += pre.gsub(/(?<=<pre>).*(?=<\/pre>)/, cleaned)
  end
  res
end

I would recommend you yo read Nokogiri Cheatsheet to have a better understanding of the methods I used in the code. Happy coding! Hope I could help

answered Feb 02 '19 at 02:21

tkhuynh

941
7
15

This was exactly what I was looking for, thanks! I figured Nokogiri was the best possible route but wasn't super familiar with it. Good work. – zstrad44 Feb 04 '19 at 15:12
One more issue I am seeing. It seems like this approach doesn't work if the `pre` tags contain `\r\n`. – zstrad44 Feb 04 '19 at 17:19
Actually, it was an issue with my regex, adding an `m` fixed the issue, so the regex is now `/(?<=
```
).*(?=<\/pre>)/m`.
```
– zstrad44 Feb 04 '19 at 17:34

How to find multiple substring matches within a string, alter substring enclosures

1 Answers1