Making character-range Regexp work with Ruby 1.9

Question

These operations work in Ruby 1.8, but I can't make them work in Ruby 1.9:

irb(main):002:0> "Café".match(/[\x80-\xff]/)
SyntaxError: (irb):2: invalid multibyte escape: /[\x80-\xff]/

irb(main):003:0> "Café".match(Regexp.new('[\x80-\xff]', nil, 'n'))
Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)

How can I fix this?

you could use this `/[^\p{ASCII}]/` which will match anything not in `/[\x00-\x7F]/` [Example](http://rubular.com/r/sXlJACAwdS) — engineersmnky, May 13 '15 at 20:03
What is it you’re trying to do? You could do this: `"Café".force_encoding('binary').match(/[\x80-\xff]/n)` – at least it doesn’t raise any exceptions, but it doesn’t really make much sense with a unicode string. — matt, May 13 '15 at 20:22

Wiktor Stribiżew · Accepted Answer · 2015-05-13T20:48:19.797

2

If you plan to capture the range expressed with code points, you'll need to use \u notation with the utf-8 encoding header:

#!/bin/env ruby
# encoding: utf-8

puts "Café".match(/[\u0080-\uFFFF]/)

The output of the demo program is é.

edited May 13 '15 at 20:48

answered May 13 '15 at 20:32

Wiktor Stribiżew

607,720
39
448
563

what would be the difference between your example and say `/[^[:ascii:]]/` or `/[^\p{ASCII}]/` just for edification purposes? – engineersmnky May 13 '15 at 20:57
I asked that question myself. Because this way we can modify the range, use subrange, for instance. `[^\p{ASCII}]` is a set shorthand class. – Wiktor Stribiżew May 13 '15 at 20:59
Fair enough this would allow for range manipulation but in essence what you have currently is the same as it contains the full range of non-ascii characters. Due to the fact that your answer allows for more flexibility I am inclined to agree that this is the best answer to an ambiguous question. – engineersmnky May 13 '15 at 21:03
Is there a method that would work in both ruby 1.8 and ruby 1.9? – benji May 13 '15 at 21:22
I think this one will work. If you plan to only match non-ASCII, you can use the already discussed `[^\p{ASCII}]` class. – Wiktor Stribiżew May 13 '15 at 21:25
somehow in 1.8 it matches "a" – benji May 13 '15 at 21:29
Did you add `u` to the regex: `.match(/[\u0080-\uFFFF]/u)`, and did you declare the UTF-8 encoding? – Wiktor Stribiżew May 13 '15 at 21:31
with 1.8:# encoding: utf-8 puts "Café".match(/[\u0080-\uFFFF]/) => C puts "Café".match(/[\x80-\xff]/) => nothing puts "Café".match(/[^\p{ASCII}]/) => a puts "Café".match(/[\u0080-\uFFFF]/u) => C – benji May 13 '15 at 21:34
Interesting. Have a look at [this post](http://stackoverflow.com/a/4585339/3832970), I hope it can help. – Wiktor Stribiżew May 13 '15 at 21:37
@bananasplit the inverse of the answer works well across versions eg /[^\u0000-\u007F]/u will select é in 1.8.7, 1.9.3, and 2.1.5 – engineersmnky May 14 '15 at 12:08

Making character-range Regexp work with Ruby 1.9

1 Answers1