21

I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g © etc. See the following example.

#coding: utf-8
s = " Hello this a mixed string © that I made."
puts s.encoding
puts s.encode

output:

UTF-8
Hello this a mixed str

ing © that I made.

When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT

So my problem is that I want to get rid of all non ASCII characters before using it. I will not know which encoding the source string "s" uses.

I have been searching and experimenting for quite some time now.

If I try to use

  puts s.encode('ASCII-8BIT')

It gives the error:

 : "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)
Nick
  • 261
  • 1
  • 2
  • 4

3 Answers3

43

You can just literally translate what you asked into a Regexp. You wrote:

I want to get rid of all non ASCII characters

We can rephrase that a little bit:

I want to substitue all characters which don't thave the ASCII property with nothing

And that's a statement that can be directly expressed in a Regexp:

s.gsub!(/\P{ASCII}/, '')

As an alternative, you could also use String#delete!:

s.delete!("^\u{0000}-\u{007F}")
Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
  • 8
    1000.times { puts "6 out of 5 stars" } -- This saved my bacon Jörg. Thank you for educating me by proxy. – lazyPower Oct 12 '12 at 00:26
  • 1
    for the `{ASCII}` one I get `Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)` on ruby 1.9.3 – Brian Armstrong Jan 17 '13 at 23:42
  • 2
    for ruby 1.9,.3 you need to use delete, not the {ASCII} approach – jpw Jul 23 '14 at 01:40
  • This `s.delete!("^\u{0000}-\u{007F}")` saved me from the misery!!! Thank you very so much. – CharlesC Mar 23 '17 at 14:15
2

Strip out the characters using regex. This example is in C# but the regex should be the same: How can you strip non-ASCII characters from a string? (in C#)

Translating it into ruby using gsub should not be difficult.

Community
  • 1
  • 1
sosborn
  • 14,676
  • 2
  • 42
  • 46
1

UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.

This is all just off the top of my head. I could be wrong.

Borealid
  • 95,191
  • 9
  • 106
  • 122