1

I have a question that is almost identical to "Ruby gsub multiple characters in string".

However, my string contains special characters:

a = "<p>text</p> <strong>bold</strong> and <em>italic</em>"

Using /\w+/ doesn't work for me. I tried many different combinations, but no luck. What RegEx match should I enter below to make it work? I want to replace those matches wherever they are in the string.

By the way I am using Rails.

My desired matches are:

a.gsub({{WHAT REGEX EXP?}},
  "\r\n" => "",
  "<p>" => "",
  "</p>" => "\n\n",
  "<br />" => "\n",
  "<strong>" => "*",
  "</strong>" => "*",
  "<em>" => "_",
  "</em>" => "_",
  "<s>" => "~",
  "</s>" => "~",
  "<blockquote>" => ">",
  "</blockquote>" => ">",
  "&" => "&amp;",
  "<" => "&lt;",
  ">" => "&gt;"
)
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Ben
  • 2,957
  • 2
  • 27
  • 55
  • It looks more like [this](https://stackoverflow.com/questions/28423345/gsub-for-multiple-patterns-and-multiple-replacements) one. – Sebastián Palma Jul 26 '19 at 20:57
  • @SebastianPalma, according to your link, you can't do multiple replacements with gsub, but you can. If you look at the link I provided, it does multiple replacements, but the regex handles characters only. I just need to handle any character. – Ben Jul 26 '19 at 21:11
  • If either answer was helpful please select the one that was most helpful to you. – Cary Swoveland Aug 01 '19 at 14:26

3 Answers3

2

#gsub works:

replacements = {
  "\r\n" => "",
  "<p>" => "",
  "</p>" => "\n\n",
  "<br />" => "\n",
  "<strong>" => "*",
  "</strong>" => "*",
  "<em>" => "_",
  "</em>" => "_",
  "<s>" => "~",
  "</s>" => "~",
  "<blockquote>" => ">",
  "</blockquote>" => ">",
  "&" => "&amp;",
  "<" => "&lt;",
  ">" => "&gt;"
}

a = "<p>text</p> <strong>bold</strong> and <em>italic</em>"

replacements.each do |find, replace|
  a.gsub!(find, replace)
end

a # => "text\n\n *bold* and _italic_"
fphilipe
  • 9,739
  • 1
  • 40
  • 52
  • Thanks Philipe, that would work, but I am looking for a way to do this with just one gsub call. I believe I am just missing the right regex. – Ben Jul 26 '19 at 21:03
  • 3
    Why? This is simpler, most likely faster, and easier to maintain than a regex. – fphilipe Jul 26 '19 at 21:45
  • Very possible. Honestly, both answers are correct and yours is indeed easier to maintain. So I'll accept your "possible duplicate" as a solution and mention to whoever reads this in the future that both answers here are acceptable. Thanks again. – Ben Jul 26 '19 at 22:28
  • Agreed, it's probably a LOT faster. A string search in much faster than an unanchored pattern. – the Tin Man Jul 26 '19 at 22:37
2

It can be done in one go:

replacements = {
  "\r\n" => "",
  "<p>" => "",
  "</p>" => "\n\n",
  "<br />" => "\n",
  "<strong>" => "*",
  "</strong>" => "*",
  "<em>" => "_",
  "</em>" => "_",
  "<s>" => "~",
  "</s>" => "~",
  "<blockquote>" => ">",
  "</blockquote>" => ">",
  "&" => "&amp;",
  "<" => "&lt;",
  ">" => "&gt;"
}

keys = Regexp.union(replacements.keys)
a    = "<p>text</p> <strong>bold</strong> and <em>italic</em>"

p a.gsub(keys, replacements) # => "text\n\n *bold* and _italic_"

This works so easily because Regexp.union does all the hard work (escaping the weird chars) for you.

steenslag
  • 79,051
  • 16
  • 138
  • 171
  • very nicely done. Will be using this. I wonder which method works faster... I would compare benchmarks but I have no idea how to do that :) – Ben Jul 26 '19 at 23:29
  • 1
    Be sure to read the answer to https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags . – steenslag Jul 26 '19 at 23:32
  • 1
    In regex long chains of alternatives, esp with a common prefix (in this case `<` or ` – mrzasa Jul 27 '19 at 10:37
1

You can do it with a single call, regex is /<[^>]+>|[<>&]/

a = "<p>text</p> <strong>bold</strong> and <em>italic</em> & <>"
a.gsub(/(<[^>]+>|[<>&])/, replacements)
# => "text\n\n *bold* and _italic_ &amp; &lt;&gt;"

Demo

String#gsub(pattern, hash) → new_str If the second argument is a Hash, and the matched text is one of its keys, the corresponding value is the replacement string. Docs

Regex explanation:

  • <[^>]+> matches HTML tags - you first match <, then one or multiple characters that are not > with [^>]+ and then >
  • [<>&] matches special single occurrences of special characters like <, > or &

That said, regex is not the best tool to process HTML, it's better to use HTML parser (e.g. Nokogiri).

mrzasa
  • 22,895
  • 11
  • 56
  • 94
  • Mrzasa, this works, thanks for you detailed explanation. I believe you have one mistake when you write "then > * [<>&]" should be "then > | [<>&]", right? How would you do the above with nokogiri? would it be cleaner?, and would it work faster and require less resources? What would be the benefits? Thanks again. – Ben Jul 26 '19 at 21:28
  • I messed with list formatting, fixed. Thanks @Ben :) – mrzasa Jul 26 '19 at 21:29