1

These operations work in Ruby 1.8, but I can't make them work in Ruby 1.9:

irb(main):002:0> "Café".match(/[\x80-\xff]/)
SyntaxError: (irb):2: invalid multibyte escape: /[\x80-\xff]/

irb(main):003:0> "Café".match(Regexp.new('[\x80-\xff]', nil, 'n'))
Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)

How can I fix this?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
benji
  • 2,331
  • 6
  • 33
  • 62
  • you could use this `/[^\p{ASCII}]/` which will match anything not in `/[\x00-\x7F]/` [Example](http://rubular.com/r/sXlJACAwdS) – engineersmnky May 13 '15 at 20:03
  • What is it you’re trying to do? You could do this: `"Café".force_encoding('binary').match(/[\x80-\xff]/n)` – at least it doesn’t raise any exceptions, but it doesn’t really make much sense with a unicode string. – matt May 13 '15 at 20:22

1 Answers1

2

If you plan to capture the range expressed with code points, you'll need to use \u notation with the utf-8 encoding header:

#!/bin/env ruby
# encoding: utf-8

puts "Café".match(/[\u0080-\uFFFF]/)

The output of the demo program is é.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • what would be the difference between your example and say `/[^[:ascii:]]/` or `/[^\p{ASCII}]/` just for edification purposes? – engineersmnky May 13 '15 at 20:57
  • I asked that question myself. Because this way we can modify the range, use subrange, for instance. `[^\p{ASCII}]` is a set shorthand class. – Wiktor Stribiżew May 13 '15 at 20:59
  • Fair enough this would allow for range manipulation but in essence what you have currently is the same as it contains the full range of non-ascii characters. Due to the fact that your answer allows for more flexibility I am inclined to agree that this is the best answer to an ambiguous question. – engineersmnky May 13 '15 at 21:03
  • Is there a method that would work in both ruby 1.8 and ruby 1.9? – benji May 13 '15 at 21:22
  • I think this one will work. If you plan to only match non-ASCII, you can use the already discussed `[^\p{ASCII}]` class. – Wiktor Stribiżew May 13 '15 at 21:25
  • somehow in 1.8 it matches "a" – benji May 13 '15 at 21:29
  • Did you add `u` to the regex: `.match(/[\u0080-\uFFFF]/u)`, and did you declare the UTF-8 encoding? – Wiktor Stribiżew May 13 '15 at 21:31
  • with 1.8:# encoding: utf-8 puts "Café".match(/[\u0080-\uFFFF]/) => C puts "Café".match(/[\x80-\xff]/) => nothing puts "Café".match(/[^\p{ASCII}]/) => a puts "Café".match(/[\u0080-\uFFFF]/u) => C – benji May 13 '15 at 21:34
  • Interesting. Have a look at [this post](http://stackoverflow.com/a/4585339/3832970), I hope it can help. – Wiktor Stribiżew May 13 '15 at 21:37
  • @bananasplit the inverse of the answer works well across versions eg /[^\u0000-\u007F]/u will select é in 1.8.7, 1.9.3, and 2.1.5 – engineersmnky May 14 '15 at 12:08