5

How can I remove non-printable invisible characters from string?

Ruby version: 2.4.1

2.4.1 :209 > product.name.gsub(/[^[:print:]]/,'.')
 => "Kanha‬" 
2.4.1 :210 > product.name.gsub(/[^[:print:]]/,'.').length
 => 6 

2.4.1 :212 > product.name.gsub(/[\u0080-\u00ff]/, '').length
 => 6 

2.4.1 :214 > product.name.chars.reject { |char| char.ascii_only? and (char.ord < 32 or char.ord == 127) }.join.length
 => 6 

2.4.1 :216 > product.name.gsub(/[^[:print:]]/i, '').length
 => 6 

The word "Kanha" has 5 letters. However there is a 6th character that is not printable. How can I remove it?

By googling and SOing I have already tried few approaches, but as you can see none of those are helpful.

It is causing problems when I try to integrate out data with other systems.

Surya
  • 2,429
  • 1
  • 21
  • 42
  • 1
    Possible duplicate of [How to remove non-printable/invisible characters in ruby?](https://stackoverflow.com/questions/16530038/how-to-remove-non-printable-invisible-characters-in-ruby) – dug Apr 25 '19 at 14:27
  • Nope have tried those and none of those works and the links given in the answer there are not functional now :( – Surya Apr 25 '19 at 15:29
  • Your unwanted character (`U+202C`) is considered printable (see `product.name.each_char.all?(/[[:print:]]/)`). Do you have issues with other characters? Deleting one character should be easy. – cremno Apr 25 '19 at 16:11
  • product.name.each_char.all?(/[[:print:]]/) This gives me error ArgumentError: wrong number of arguments (given 1, expected 0) I dont get what you are trying to say – Surya Apr 25 '19 at 16:57
  • It would be very helpful if you would replace `product.name` with its value (do that with `puts product.name`, cut and paste and add quotes). That way we could figure out what the offending character is, which may lead to a solution. Except for that it's an interesting question. btw, you don't need "Ruby" in the title as it is a tag. – Cary Swoveland Apr 25 '19 at 18:36
  • @cremno how did you find the unwanted character was `U+202C` and product.name.each_char.all?(/[[:print:]]/) This gives me error ArgumentError: wrong number of arguments (given 1, expected 0) I dont get what you are trying to say Please do explain – Surya Apr 30 '19 at 10:57
  • Sorry I forgot this feature was introduced in 2.5 and not 2.4. Also the accepted answer already mentions how to find such 'invisible' characters. – cremno May 01 '19 at 06:32

1 Answers1

9

First, let's figure out what the offending character is:

str = "Kanha‬"
p str.codepoints
# => [75, 97, 110, 104, 97, 8236]

The first five codepoints are between 0 and 127, meaning they're ASCII characters. It's safe to assume they're the letters K-a-n-h-a, although this is easy to verify if you want:

p [75, 97, 110, 104, 97].map(&:ord)
# => ["K", "a", "n", "h", "a"]

That means the offending character is the last one, codepoint 8236. That's a decimal (base 10) number, though, and Unicode characters are usually listed by their hexadecimal (base 16) number. 8236 in hexadecimal is 202C (8236.to_s(16) # => "202c"), so we just have to google for U+202C.

Google very quickly tells us that the offending character is U+202C POP DIRECTIONAL FORMATTING and that it's a member of the "Other, Format" category of Unicode characters. Wikipedia says of this category:

Includes the soft hyphen, joining control characters (zwnj and zwj), control characters to support bi-directional text, and language tag characters

It also tells us that the "value" or code for the category is "Cf". If these sound like characters you want to remove from your string along with U+202C, you can use the \p{Cf} property in a Ruby regular expression. You can also use \P{Print} (note the capital P) as an equivalent to [^[:print]]:

str = "Kanha‬"
p str.length # => 6

p str.gsub(/\P{Print}|\p{Cf}/, '') # => "Kahna"
p str.gsub(/\P{Print}|\p{Cf}/, '').length # => 5

See it on repl.it: https://repl.it/@jrunning/DutifulRashTag

Jordan Running
  • 102,619
  • 17
  • 182
  • 182
  • 2
    Thank you so much. That works. How did you find this was the problem? – Surya Apr 26 '19 at 04:35
  • 1
    cremno in the comments above pointed out that the offending character was U+202C, so it was easy to google for, but I've edited my answer to add some details on how to figure that much out on your own. And I know from reading the Ruby Regexp docs that `\p{...}` will match any Unicode category (and `\P{...}` is its inverse), so it was just a matter of figuring out which category. – Jordan Running Apr 26 '19 at 16:22
  • @JordanRunning This was so very helpful. Perfect example of how to respond to a question. – SamuelLJohnson Jan 26 '23 at 00:44