0

In Ruby (2.4), I can create a string whose encoding is UTF-8 but which contains a byte invalid in UTF-8 (let's use the byte E1).

Then when I try to match a regex against this string, I get an error.

2.4.0 :001 > "Hi! \xE1".match?(//)
ArgumentError: invalid byte sequence in UTF-8
        from (irb):1:in `match?'
        from (irb):1

When I do the same thing in Python 3, I do not get an error.

>>> import re; re.match('', "Hi! \xE1")
<_sre.SRE_Match object; span=(0, 0), match=''>

My understanding is that, in both cases, I am in a state of sin because I am creating UTF-8-encoded strings that contain bytes invalid in UTF-8. Given that:

  • Is it specifically regex comparisons that fail in Ruby, and not other operations? If so, why?
  • What accounts for the difference between Ruby and Python here?
  • Is it possible to get Python to give an error of this type? (Without interacting with external resources -- I know this can happen in the context of connecting to a database, for example.)
Eli Rose
  • 6,788
  • 8
  • 35
  • 55
  • I think the error is raised by Ruby's regex engine. And Python probably uses a different engine. I don't know Python but it certainly can check whether a string is valid UTF-8. – Stefan Jan 21 '18 at 11:08
  • I cannot reproduce Ruby crashing. That would be a bug. It should raise an exception instead, but not crash. You should file a bug. – Jörg W Mittag Jan 21 '18 at 19:01
  • Hi @JörgWMittag -- sorry, I was using 'crash' imprecisely. I meant raising an exception. I edited the question. – Eli Rose Jan 21 '18 at 20:47

1 Answers1

0

In Ruby, creating a new string using quotes (i.e. 'Hi!') will create an instance of the core String class. As you noted, in 2.0 or later, ruby defaults to interpreting strings in source files as UTF-8. If you then call a method on the string instance, it will use the configured encoding to interpret the bytes that make up the string and apply the method (so to answer your first question, it's not specific to regex matches -- you would see the same error if you called gsub or split or any other string method).

As this post helpfully details, python 3 defaults to interpreting strings as Unicode; however, while ruby defaults to UTF-8, python 3 defaults to UTF-16 or UTF-32 depending on how the interpreter was built, so \xE1 is not invalid.

Interestingly enough, if you give python some hex that is not unicode, it seems to leave it as plaintext:

>>> '\uffff'
'\uffff'

whereas if you give it nonsense (non-hex) it will raise an error:

>>> "Hi! \xz1"
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 4-5: truncated \xXX escape
Zach Schneider
  • 13,317
  • 1
  • 14
  • 6
  • I found the following blog post to be informative: [Testing Ruby's Unicode Support](http://blog.honeybadger.io/ruby-s-unicode-support/). – Mark Thomas Jan 21 '18 at 19:32
  • Hmm, interesting, and you're right -- I see the same error on other string methods. I'm confused by your use of `"\xz1"` however -- `z1` is not valid hex notation, so what byte is that referring to? To me it seems like the error is saying you have an invalid escape code, not that there are some invalid bytes in this string. – Eli Rose Jan 21 '18 at 21:32
  • Good point, I updated my answer. Looks like it actually leaves non-unicode hex notation as plaintext. I spend the vast majority of my time in ruby rather than python so I'm not sure why that is the case. – Zach Schneider Jan 22 '18 at 01:03
  • 1
    [ruby]: Not quite "any other string method": `String#+` does not interpret the bytes, it'll just blindly concatenate as long as encodings are compatible. E.g. `"\xF0" + "\x9F" + "\x98" + "\xBC"` produces `""` without any errors. – Amadan Jan 22 '18 at 01:13