In Ruby (2.4), I can create a string whose encoding is UTF-8 but which contains a byte invalid in UTF-8 (let's use the byte E1
).
Then when I try to match a regex against this string, I get an error.
2.4.0 :001 > "Hi! \xE1".match?(//)
ArgumentError: invalid byte sequence in UTF-8
from (irb):1:in `match?'
from (irb):1
When I do the same thing in Python 3, I do not get an error.
>>> import re; re.match('', "Hi! \xE1")
<_sre.SRE_Match object; span=(0, 0), match=''>
My understanding is that, in both cases, I am in a state of sin because I am creating UTF-8-encoded strings that contain bytes invalid in UTF-8. Given that:
- Is it specifically regex comparisons that fail in Ruby, and not other operations? If so, why?
- What accounts for the difference between Ruby and Python here?
- Is it possible to get Python to give an error of this type? (Without interacting with external resources -- I know this can happen in the context of connecting to a database, for example.)