How to remove all non - ASCII characters from a string in Ruby

Question

I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g Â© etc. See the following example.

#coding: utf-8
s = " Hello this a mixed string Â© that I made."
puts s.encoding
puts s.encode

output:

UTF-8
Hello this a mixed str

ing ┬⌐ that I made.

When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT

So my problem is that I want to get rid of all non ASCII characters before using it. I will not know which encoding the source string "s" uses.

I have been searching and experimenting for quite some time now.

If I try to use

  puts s.encode('ASCII-8BIT')

It gives the error:

 : "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)

score 43 · Answer 1 · answered Jul 08 '10 at 09:07

43

You can just literally translate what you asked into a Regexp. You wrote:

I want to get rid of all non ASCII characters

We can rephrase that a little bit:

I want to substitue all characters which don't thave the ASCII property with nothing

And that's a statement that can be directly expressed in a Regexp:

s.gsub!(/\P{ASCII}/, '')

As an alternative, you could also use String#delete!:

s.delete!("^\u{0000}-\u{007F}")

answered Jul 08 '10 at 09:07

Jörg W Mittag

363,080
75
446
653

8

1000.times { puts "6 out of 5 stars" } -- This saved my bacon Jörg. Thank you for educating me by proxy. – lazyPower Oct 12 '12 at 00:26
1

for the `{ASCII}` one I get `Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)` on ruby 1.9.3 – Brian Armstrong Jan 17 '13 at 23:42
2

for ruby 1.9,.3 you need to use delete, not the {ASCII} approach – jpw Jul 23 '14 at 01:40
This `s.delete!("^\u{0000}-\u{007F}")` saved me from the misery!!! Thank you very so much. – CharlesC Mar 23 '17 at 14:15

score 2 · Answer 2 · edited May 23 '17 at 12:18

2

Strip out the characters using regex. This example is in C# but the regex should be the same: How can you strip non-ASCII characters from a string? (in C#)

Translating it into ruby using gsub should not be difficult.

edited May 23 '17 at 12:18

Community

1
1

answered Jul 08 '10 at 04:13

sosborn

14,676
2
42
46

score 1 · Answer 3 · answered Jul 08 '10 at 04:10

UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.

This is all just off the top of my head. I could be wrong.

How to remove all non - ASCII characters from a string in Ruby

3 Answers3

Linked