How to replace multiple characters in string with special characters

Question

I have a question that is almost identical to "Ruby gsub multiple characters in string".

However, my string contains special characters:

a = "<p>text</p> <strong>bold</strong> and <em>italic</em>"

Using /\w+/ doesn't work for me. I tried many different combinations, but no luck. What RegEx match should I enter below to make it work? I want to replace those matches wherever they are in the string.

By the way I am using Rails.

My desired matches are:

a.gsub({{WHAT REGEX EXP?}},
  "\r\n" => "",
  "<p>" => "",
  "</p>" => "\n\n",
  "<br />" => "\n",
  "<strong>" => "*",
  "</strong>" => "*",
  "<em>" => "_",
  "</em>" => "_",
  "<s>" => "~",
  "</s>" => "~",
  "<blockquote>" => ">",
  "</blockquote>" => ">",
  "&" => "&amp;",
  "<" => "&lt;",
  ">" => "&gt;"
)

It looks more like [this](https://stackoverflow.com/questions/28423345/gsub-for-multiple-patterns-and-multiple-replacements) one. — Sebastián Palma, Jul 26 '19 at 20:57
@SebastianPalma, according to your link, you can't do multiple replacements with gsub, but you can. If you look at the link I provided, it does multiple replacements, but the regex handles characters only. I just need to handle any character. — Ben, Jul 26 '19 at 21:11
If either answer was helpful please select the one that was most helpful to you. — Cary Swoveland, Aug 01 '19 at 14:26

score 2 · Answer 1 · answered Jul 26 '19 at 20:56

2

#gsub works:

replacements = {
  "\r\n" => "",
  "<p>" => "",
  "</p>" => "\n\n",
  "<br />" => "\n",
  "<strong>" => "*",
  "</strong>" => "*",
  "<em>" => "_",
  "</em>" => "_",
  "<s>" => "~",
  "</s>" => "~",
  "<blockquote>" => ">",
  "</blockquote>" => ">",
  "&" => "&amp;",
  "<" => "&lt;",
  ">" => "&gt;"
}

a = "<p>text</p> <strong>bold</strong> and <em>italic</em>"

replacements.each do |find, replace|
  a.gsub!(find, replace)
end

a # => "text\n\n *bold* and _italic_"

answered Jul 26 '19 at 20:56

fphilipe

9,739
1
40
52

Thanks Philipe, that would work, but I am looking for a way to do this with just one gsub call. I believe I am just missing the right regex. – Ben Jul 26 '19 at 21:03
3

Why? This is simpler, most likely faster, and easier to maintain than a regex. – fphilipe Jul 26 '19 at 21:45
Very possible. Honestly, both answers are correct and yours is indeed easier to maintain. So I'll accept your "possible duplicate" as a solution and mention to whoever reads this in the future that both answers here are acceptable. Thanks again. – Ben Jul 26 '19 at 22:28
Agreed, it's probably a LOT faster. A string search in much faster than an unanchored pattern. – the Tin Man Jul 26 '19 at 22:37

steenslag · Answer 2 · 2019-07-26T23:21:50.553

2

It can be done in one go:

replacements = {
  "\r\n" => "",
  "<p>" => "",
  "</p>" => "\n\n",
  "<br />" => "\n",
  "<strong>" => "*",
  "</strong>" => "*",
  "<em>" => "_",
  "</em>" => "_",
  "<s>" => "~",
  "</s>" => "~",
  "<blockquote>" => ">",
  "</blockquote>" => ">",
  "&" => "&amp;",
  "<" => "&lt;",
  ">" => "&gt;"
}

keys = Regexp.union(replacements.keys)
a    = "<p>text</p> <strong>bold</strong> and <em>italic</em>"

p a.gsub(keys, replacements) # => "text\n\n *bold* and _italic_"

This works so easily because Regexp.union does all the hard work (escaping the weird chars) for you.

edited Jul 26 '19 at 23:21

answered Jul 26 '19 at 23:16

steenslag

79,051
16
138
171

very nicely done. Will be using this. I wonder which method works faster... I would compare benchmarks but I have no idea how to do that :) – Ben Jul 26 '19 at 23:29
1

Be sure to read the answer to https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags . – steenslag Jul 26 '19 at 23:32
1

In regex long chains of alternatives, esp with a common prefix (in this case `<` or ` – mrzasa Jul 27 '19 at 10:37

mrzasa · Answer 3 · 2019-07-26T21:29:16.357

1

You can do it with a single call, regex is /<[^>]+>|[<>&]/

a = "<p>text</p> <strong>bold</strong> and <em>italic</em> & <>"
a.gsub(/(<[^>]+>|[<>&])/, replacements)
# => "text\n\n *bold* and _italic_ &amp; &lt;&gt;"

Demo

String#gsub(pattern, hash) → new_str If the second argument is a Hash, and the matched text is one of its keys, the corresponding value is the replacement string. Docs

Regex explanation:

<[^>]+> matches HTML tags - you first match <, then one or multiple characters that are not > with [^>]+ and then >
[<>&] matches special single occurrences of special characters like <, > or &

That said, regex is not the best tool to process HTML, it's better to use HTML parser (e.g. Nokogiri).

edited Jul 26 '19 at 21:29

answered Jul 26 '19 at 21:10

mrzasa

22,895
11
56
94

Mrzasa, this works, thanks for you detailed explanation. I believe you have one mistake when you write "then > * [<>&]" should be "then > | [<>&]", right? How would you do the above with nokogiri? would it be cleaner?, and would it work faster and require less resources? What would be the benefits? Thanks again. – Ben Jul 26 '19 at 21:28
I messed with list formatting, fixed. Thanks @Ben :) – mrzasa Jul 26 '19 at 21:29

How to replace multiple characters in string with special characters

3 Answers3