How to create a string with a "bad encoding" in ruby?

Question

I have a file somewhere out in production that I do not have access to that, when loaded by a ruby script, a regular expression against the contents fails with a ArgumentError => invalid byte sequence in UTF-8.

I believe I have a fix based on the answer with all the points here: ruby 1.9: invalid byte sequence in UTF-8

# Remove all invalid and undefined characters in the given string
# (ruby 1.9.3)
def safe_str str

  # edited based on matt's comment (thanks matt)
  s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
  s.encode!('utf-8', 'utf-16')
end

However, I now want to build my rspec to verify that the code works. I don't have access to the file that caused the problem so I want to create a string with the bad encoding programatically.

I've tried variations on things like:

bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.length.should > safe_str(bad_str).length

or,

bad_str = (100..1000).to_a.pack(c*)
bad_str.length.should > safe_str(bad_str).length

but the length is always the same. I have also tried different character ranges; not always 100 to 1000.

Any suggestions on how to build a string with an invalid encoding within a ruby 1.9.3 script?

Do your bad strings trigger the original "invalid byte sequence" exception? Maybe they really are bad, but `safe_str` is not catching it for some reason. — Hew Wolff, Aug 14 '13 at 18:38
Thanks @HewWolff I havn't deployed it yet. I wanted to get my tests to behave properly (which is a good thing based on matt's comment below). — GSP, Aug 14 '13 at 20:35

score 5 · Answer 1 · answered Aug 14 '13 at 18:33

5

Lots of one-byte strings will make an invalid UTF-8 string, starting with 0x80. So 128.chr should work.

answered Aug 14 '13 at 18:33

Hew Wolff

1,489
8
17

Thank you. I forgot about the `chr` method on `Integer`. – GSP Aug 14 '13 at 20:37

score 3 · Accepted Answer · edited May 23 '17 at 10:28

Your safe_str method will (currently) never actually do anything to the string, it is a no-op. The docs for String#encode on Ruby 1.9.3 say:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

This is true for the current release of 2.0.0 (patch level 247), however a recent commit to Ruby trunk changes this, and also introduces a scrub method that pretty much does what you want.

Until a new version of Ruby is released you will need to round trip your text string to another encoding and back to clean it, as in the second example in this answer to the question you linked to, something like:

def safe_str str
  s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
  s.encode!('utf-8', 'utf-16')
end

Note that your first example of an attempt to create an invalid string won’t work:

bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.valid_encoding? # => true

From the << docs:

If the object is a Integer, it is considered as a codepoint, and is converted to a character before concatenation.

So you’ll always get a valid string.

Your second method, using pack will create a string with the encoding ASCII-8BIT. If you then change this using force_encoding you can create a UTF-8 string with an invalid encoding:

bad_str = (100..1000).to_a.pack('c*').force_encoding('utf-8')
bad_str.valid_encoding? # => false

Thank you. It was the `force_encoding` step that was consistently throwing me off. — GSP, Aug 14 '13 at 20:36
P.S. I wish I had access to ruby 2. The `scrub` method is exactly what I need. — GSP, Aug 14 '13 at 20:40

score 3 · Answer 3 · answered Jul 23 '20 at 09:33

3

Try with s = "hi \255"

s.valid_encoding?
# => false

answered Jul 23 '20 at 09:33

Iwan B.

3,982
2
27
18

score 1 · Answer 4 · answered Feb 01 '22 at 07:23

Following example can be used for testing purposes:

describe TestClass do
  let(:non_utf8_text) { "something\255 english." }

  it 'is not raise error on invalid byte sequence string' do
    expect(non_utf8_text).not_to be_valid_encoding
    expect { subject.call(non_utf8_text) }.not_to raise_error
  end
end

Thanks to Iwan B. for "\255" advise.

score 0 · Answer 5 · answered Aug 14 '13 at 18:58

0

In spec tests I’ve written, I haven’t found a way to fix this bad encoding:

Period%Basics

The %B string consistently produces ArgumentError: invalid byte sequence in UTF-8.

answered Aug 14 '13 at 18:58

parhamr

557
5
12

I'm not sure what you mean by the '%'? Are you using that to indicate a hex value like `"Period\xBasics"`? – GSP Aug 14 '13 at 20:38
I haven’t even bothered to check what the value maps to or what it represents. I became aware of the string by checking Airbrake exceptions and the bad string came from a GET param. I tried a variety of ways to catch or fix the exception, but didn’t have luck. – parhamr Aug 14 '13 at 22:10

How to create a string with a "bad encoding" in ruby?

5 Answers5