Split Unicode entities by graphemes

Question

"d̪".chars.to_a

gives me

["d"," ̪"]

How do I get Ruby to split it by graphemes?

["d̪"]

You want to split at graphemes? – Joey Oct 22 '12 at 18:57 — Joey, Oct 22 '12 at 18:57

Inkling · Accepted Answer · 2019-06-15T05:15:30.750

Edit: As @michau's answer notes, Ruby 2.5 introduced the grapheme_clusters method, as well as each_grapheme_cluster if you just want to iterate/enumerate without necessarily creating an array.

In Ruby 2.0 or above you can use str.scan /\X/

> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]

# Let's get crazy:


> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'


> str.length
=> 75
> str.scan(/\X/).length
=> 6

If you want to match the grapheme boundaries for any reason, you can use (?=\X) in your regex, for instance:

> "d̪".split /(?=\X)/
=> ["d̪"]

ActiveSupport (which is included in Rails) also has a way if you can't use \X for some reason:

ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }

score 2 · Answer 2 · edited Jun 14 '19 at 14:00

2

The following code should work in Ruby 2.5:

"d̪".grapheme_clusters # => ["d̪"]

edited Jun 14 '19 at 14:00

Malekai

4,765
5
25
60

answered Jun 14 '19 at 13:03

kxmh42

3,121
1
25
15

score 1 · Answer 3 · answered Oct 22 '12 at 20:10

Use Unicode::text_elements from unicode.gem which is documented at http://www.yoshidam.net/unicode.txt.

irb(main):001:0> require 'unicode'
=> true
irb(main):006:0> s = "abčd̪é"
=> "abčd̪é"
irb(main):007:0> s.chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):009:0> Unicode.nfc(s).chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):010:0> Unicode.nfd(s).chars.to_a
=> ["a", "b", "c", "̌", "d", "̪", "e", "́"]
irb(main):017:0> Unicode.text_elements(s)
=> ["a", "b", "č", "d̪", "é"]

Not everything can be normalised, so it's safer to use `s.scan(/\X/)` or `s.grapheme_clusters` instead. — kxmh42, Jun 14 '19 at 13:10

score -1 · Answer 4 · answered Aug 09 '13 at 08:09

-1

Ruby2.0

   str = "d̪"

   char = str[/\p{M}/]

   other = str[/\w/]

answered Aug 09 '13 at 08:09

user757123

19
1

Split Unicode entities by graphemes

4 Answers4