How to get rid of non-ascii characters in ruby

Question

I have a Ruby CGI (not rails) that picks photos and captions from a web form. My users are very keen on using smart quotes and ligatures, they are pasting from other sources. My web app does not deal well with these non-ASCII characters, is there a quick Ruby string manipulation routine that can get rid of non-ASCII chars?

Nathan Long · Answer 1 · 2016-08-19T15:39:07.950

Use String#encode

The official way to convert between string encodings as of Ruby 1.9 is to use String#encode.

To simply remove non-ASCII characters, you could do this:

some_ascii   = "abc"
some_unicode = "áëëçüñżλφθΩ"
more_ascii   = "123ABC"
invalid_byte = "\255"

non_ascii_string = [some_ascii, some_unicode, more_ascii, invalid_byte].join

# See String#encode documentation
encoding_options = {
  :invalid           => :replace,  # Replace invalid byte sequences
  :undef             => :replace,  # Replace anything not defined in ASCII
  :replace           => '',        # Use a blank for those replacements
  :universal_newline => true       # Always break lines with \n
}

ascii = non_ascii_string.encode(Encoding.find('ASCII'), encoding_options)
puts ascii.inspect
  # => "abce123ABC"

Notice that the first 5 characters in the result are "abce1" - the "á" was discarded, one "ë" was discarded, but another "ë" appears to have been converted to "e".

The reason for this is that there are sometimes multiple ways to express the same written character in Unicode. The "á" is a single Unicode codepoint. The first "ë" is, too. When Ruby sees these during this conversion, it discards them.

But the second "ë" is two codepoints: a plain "e", just like you'd find in an ASCII string, followed by a "combining diacritical mark" (this one), which means "put an umlaut on the previous character". In the Unicode string, these are interpreted as a single "grapheme", or visible character. When converting this, Ruby keeps the plain ASCII "e" and discards the combining mark.

If you decide you'd like to provide some specific replacement values, you could do this:

REPLACEMENTS = { 
  'á' => "a",
  'ë' => 'e',
}

encoding_options = {
  :invalid   => :replace,     # Replace invalid byte sequences
  :replace => "",             # Use a blank for those replacements
  :universal_newline => true, # Always break lines with \n
  # For any character that isn't defined in ASCII, run this
  # code to find out how to replace it
  :fallback => lambda { |char|
    # If no replacement is specified, use an empty string
    REPLACEMENTS.fetch(char, "")
  },
}

ascii = non_ascii_string.encode(Encoding.find('ASCII'), encoding_options)
puts ascii.inspect
  #=> "abcaee123ABC"

Update

Some have reported issues with the :universal_newline option. I have seen this intermittently, but haven't been able to track down the cause.

When it happens, I see Encoding::ConverterNotFoundError: code converter not found (universal_newline). However, after some RVM updates, I've just run the script above under the following Ruby versions without problems:

ruby-1.9.2-p290
ruby-1.9.3-p125
ruby-1.9.3-p194
ruby-1.9.3-p362
ruby-2.0.0-preview2
ruby-head (as of 12-31-2012)

Given this, it doesn't appear to be a deprecated feature or even a bug in Ruby. If anyone knows the cause, please comment.

I'm seeing the code converter not found (universal_newline) for ruby-1.9.3-p429 — Robert J Berger, Jun 05 '13 at 01:19
Changing the symbol `:universal_newline` to `:UNIVERSAL_NEWLINE_DECORATOR` fixes the problem for me. — Dex, Aug 12 '13 at 07:09
This helped me a lot, this was the only thing that was working for me! Thanks Nathan! — FastSolutions, May 23 '14 at 08:36
On Ruby 2.5.0 I was experiencing a Encoding::UnknownConversionError when trying to strip out Unicode characters from text. I fixed this by adding `:undef => :replace,` to the Encoding options hash — Joe Alamo, Mar 05 '19 at 11:44
Update for Ruby 2.7. Add double splat to fix the last argument deprecation warning. `encode(Encoding.find('ASCII'), **encoding_options)`. — danielricecodes, May 14 '21 at 20:56

score 41 · Answer 2 · edited Feb 28 '23 at 04:27

41

1.9


class String
 def remove_non_ascii(replacement="") 
   self.gsub(/[\u0080-\uffff]/, replacement)
 end
end

1.8


class String
 def remove_non_ascii(replacement="") 
   self.gsub(/[\x80-\xff]/, replacement)
 end
end

edited Feb 28 '23 at 04:27

rogerdpack

62,887
36
269
388

answered Aug 13 '09 at 05:13

klochner

8,077
1
33
45

1

This is the simplest way to create an ASCII projection dropping Unicode characters. It does not create a clean translation and injects multiple replacement chars for a single multi-byte Unicode char. It was the right tool for my job, though. – Winfield Jan 24 '11 at 21:21
24

In ruby 1.9, I get an exception of "invalid multibyte escape". To fix it, instead of \x80-\xff, I used \u0080-\u00ff – e3matheus Jun 18 '11 at 17:16
. . . but, you need to remove the universal_newline option in ruby build p194 (1.9.3-p194). – klochner Oct 08 '12 at 17:50
Thank you, finally! (but for me it only worked after negation /[^\u0080-\u00ff]/) – dwn Oct 25 '17 at 08:16
This is a really bad solution. There are 0x10FF7F non-ASCII chars. This will work for 0.01% of these, and will not cover those mentioned by the OP. If you want to use gsub, the Regexp should be `/[^\x00-\x7F]/`. – jaynetics Jul 30 '19 at 11:17
1

Bad solution removed a space in some cases too – Rabin Poudyal Feb 01 '21 at 10:14
Removes some characters entirely instead of replacing them with ASCII equivalents (spaces, accenteds) but gets the job done :) – rogerdpack Feb 28 '23 at 15:47

score 21 · Answer 3 · answered Aug 13 '09 at 14:27

21

Here's my suggestion using Iconv.

class String
  def remove_non_ascii
    require 'iconv'
    Iconv.conv('ASCII//IGNORE', 'UTF8', self)
  end
end

answered Aug 13 '09 at 14:27

Scott

7,034
4
24
26

This looks like the legitimate way to convert from Unicode to Ascii. – Winfield Jan 24 '11 at 21:22
1

For followers iconv was deprecated in 1.9+ – rogerdpack Feb 28 '23 at 16:07

score 7 · Answer 4 · edited Feb 28 '23 at 04:33

7

If you have activesupport you can use I18n.transliterate

I18n.transliterate("áëëçüñżλφθΩ")
"aee?cunz?????"

Or if you don't want the question marks...

I18n.transliterate("áëëçüñżλφθΩ", replacement: "")
"aeecunz"

Note that this doesn't remove invalid byte sequences it just replaces non ascii characters. For my use case this was what I wanted though and was simple.

edited Feb 28 '23 at 04:33

rogerdpack

62,887
36
269
388

answered May 05 '21 at 15:09

GregP

1,584
17
16

super-helpful newer addition to this thread – jpw Sep 16 '21 at 23:53

boulder_ruby · Answer 5 · 2013-10-28T01:01:26.923

With a bit of help from @masakielastic I have solved this problem for my personal purposes using the #chars method.

The trick is to break down each character into its own separate block so that ruby can fail.

Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.

So, given a "dirty" string, lets say you used File#read on a picture. (my case)

dirty = File.open(filepath).read    
clean_chars = dirty.chars.select do |c|
  begin
    num_or_letter?(c)
  rescue ArgumentError
    next
  end
end
clean = clean_chars.join("")

def num_or_letter?(char)
  if char =~ /[a-zA-Z0-9]/
    true
  elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
    true
  end
end

score 2 · Answer 6 · answered Oct 16 '14 at 18:53

2

class String
  def strip_control_characters
    self.chars.reject { |char| char.ascii_only? and (char.ord < 32 or char.ord == 127) }.join
  end
end

answered Oct 16 '14 at 18:53

Diego Carrion

513
5
8

score 2 · Answer 7 · answered Oct 13 '22 at 13:00

2

This should do the trick:

ascii_only_str = str.gsub(/[^[:ascii:]]/, '')

answered Oct 13 '22 at 13:00

Serhii Nadolynskyi

5,473
3
21
20

score 0 · Answer 8 · answered Aug 12 '09 at 19:51

0

Quick GS revealed this discussion which suggests the following method:

class String
  def remove_nonascii(replacement)
    n=self.split("")
    self.slice!(0..self.size)
    n.each { |b|
     if b[0].to_i< 33 || b[0].to_i>127 then
       self.concat(replacement)
     else
       self.concat(b)
     end
    }
    self.to_s
  end
end

answered Aug 12 '09 at 19:51

Joseph Weissman

5,697
5
46
75

Yes, I found that but it does not deal with unicode double byte chars right? Well, I will test this one, thanks for the help! – Aug 12 '09 at 19:54

score 0 · Answer 9 · answered Aug 13 '09 at 02:43

No there isn't short of removing all characters beside the basic ones (which is recommended above). The best slution would be handling these names properly (since most filesystems today do not have any problems with Unicode names). If your users paste in ligatures they sure as hell will want to get them back too. If filesystem is your problem, abstract it away and set the filename to some md5 (this also allows you to easily shard uploads into buckets which scan very quickly since they never have too many entries).

How to get rid of non-ascii characters in ruby

9 Answers9

Use String#encode

Update

Linked