0

I am trying to write a method to remove some blacklisted characters like bom characters using their UTF-8 values. I am successful to achieve this by creating a method in String class with the following logic,

  def remove_blacklist_utf_chars
    self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "")
    self
  end

Now to make it useful across the applications and reusable I create a config in a yml file. The yml structure is something like,

:blacklist_utf_chars:
  :zero_width_space: '"\u{200b}"'

(Edit) Also as suggested by Drenmi this didn't work,

:blacklist_utf_chars:
  :zero_width_space: \u{200b}

The problem I am facing is that the method remove_blacklist_utf_chars does not work when I load the utf-encoding of blacklist characters from yml file But when I directly pass these in the method and not via the yml file the method works.

So basically
self.force_encoding("UTF-8").gsub!("\u{200b}".force_encoding("UTF-8"), "") -- works.

but,

self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "") -- doesn't work.

I printed the value of config[:blacklist_utf_chars][:zero_width_space] and its equal to "\u{200b}"

I got this idea by referring: https://stackoverflow.com/a/5011768/2362505.

Now I am not sure how what exactly is happening when the blacklist chars list is loaded via yml in ruby code.

EDIT 2:

On further investigation I observed that there is an extra \ getting added while reading the hash from the yaml. So,

puts config[:blacklist_utf_chars][:zero_width_space].dump

prints:

"\\u{200b}"

But then if I just define the yaml as:

:blacklist_utf_chars:
  :zero_width_space: 200b

and do,

ch = "\u{#{config[:blacklist_utf_chars][:zero_width_space]}}"
self.force_encoding("UTF-8").gsub!(ch.force_encoding("UTF-8"), "")

I get

/Users/harshsingh/dir/to/code/utils.rb:121: invalid Unicode escape (SyntaxError)
Community
  • 1
  • 1
harshs08
  • 700
  • 10
  • 29

2 Answers2

2

The "\u{200b}" syntax is used for escaping Unicode characters in Ruby source code. It won’t work inside Yaml.

The equivalent syntax for a Yaml document is the similar "\u200b" (which also happens to be valid in Ruby). Note the lack of braces ({}), and also the double quotes are required, otherwise it will be parsed as literal \u200b.

So your Yaml file should look like this:

:blacklist_utf_chars:
  :zero_width_space: "\u200b"
matt
  • 78,533
  • 8
  • 163
  • 197
1

If you puts the value, and get the output "\u{200b}", it means the quotes are included in your string. I.e., you're actually calling:

self.force_encoding("UTF-8").gsub!('"\u{200b}"'.config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "")

Try changing your YAML file to:

:blacklist_utf_chars:
  :zero_width_space: \u{200b}
Drenmi
  • 8,492
  • 4
  • 42
  • 51