A grapheme is a unit of writing, generally smaller than a word. In an ideographic language, a single grapheme may carry considerable meaning, but many languages use only a smaller alphabet where a few different graphemes are arranged in various ways to build units of meaning.
Questions tagged [grapheme]
31 questions
19
votes
3 answers
How to count grapheme clusters or "perceived" emoji characters in Java
I'm looking to count the number of perceived emoji characters in a provided Java string. I'm currently using the emoji4j library, but it doesn't work for grapheme clusters like this one:
Calling EmojiUtil.getLength("") returns 4 instead of 1,…

Craig Otis
- 31,257
- 32
- 136
- 234
15
votes
3 answers
What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?
What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?
They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly…

Mathias Bynens
- 144,855
- 52
- 216
- 248
10
votes
5 answers
Get grapheme character count in javascript strings?
I'm trying to get the length of a javascript string in user-visible graphemes, ie ignoring combining characters (and surrogate pairs?). Is this possible, and if so, how would I go about it?
We're using the dojo toolkit on our project, but any…

Angus
- 123
- 1
- 8
7
votes
4 answers
What is the right way to get a grapheme?
Why does this print a U and not a Ü?
#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':utf8';
use charnames qw(:full);
my $string = "\N{LATIN CAPITAL LETTER U}\N{COMBINING DIAERESIS}";
while ( $string =~ /(\X)/g ) {
…

sid_com
- 24,137
- 26
- 96
- 187
5
votes
1 answer
Why is Swift counting this Grapheme Cluster as two characters instead of one?
Generally Swift is really smart about counting grapheme clusters as a single character. If I want to make a Lebanese flag, for example, I can combine the two Unicode characters
U+1F1F1 REGIONAL INDICATOR SYMBOL LETTER L
U+1F1E7 REGIONAL INDICATOR…

Ray Toal
- 86,166
- 18
- 182
- 232
4
votes
1 answer
PostgreSQL pattern matching with Unicode graphemes
Is there any way to pattern match with Unicode graphemes?
As a quick example, when I run this query:
CREATE TABLE test (
id SERIAL NOT NULL,
name VARCHAR NOT NULL,
PRIMARY KEY (id),
UNIQUE (name)
);
INSERT INTO test (name) VALUES…

TechnoSam
- 578
- 1
- 8
- 23
3
votes
2 answers
In Python 3, count Thai character positions
FIRST, I've used the Python 3 grapheme library to solve my problem. (For a bit more about grapheme, see this article). But I'm surprised that Python 3 couldn't do this without a specialized library...
I resorted to grapheme because after many web…

RBV
- 1,367
- 1
- 14
- 33
3
votes
1 answer
Swift String.Index vs transforming the String to an Array
In the swift doc, they say they use String.Index to index strings, as different characters can take a different amount of memory.
But I saw a lot of people transforming a String into an array var a = Array(s) so they can index by int instead of…
user509981
3
votes
1 answer
How does the SQL length function handle unicode graphemes?
Consider the following scenario where I have the string É defined by \U00000045\U00000301.
1) https://www.fileformat.info/info/unicode/char/0045/index.htm
2) https://www.fileformat.info/info/unicode/char/0301/index.htm
Would a table constrained by…

AlanSTACK
- 5,525
- 3
- 40
- 99
3
votes
4 answers
How to check if a UTF-8 string starts with an 'a'
I have a UTF-8 string given as a null-terminated const char*. I would like to know if the first letter of this string is an a by itself. The following code
bool f(const char* s) {
return s[0] == 'a';
}
is wrong, as the first letter (grapheme…

InsideLoop
- 6,063
- 2
- 28
- 55
3
votes
4 answers
Split Unicode entities by graphemes
"d̪".chars.to_a
gives me
["d"," ̪"]
How do I get Ruby to split it by graphemes?
["d̪"]

Reactormonk
- 21,472
- 14
- 74
- 123
2
votes
0 answers
How to split Devanagari bi-tri and tetra conjunct consonants as a whole from a string?
I am new to Rust and I was trying to split Devanagari (vowels and) bi-tri and tetra conjuncts consonants as whole while keeping the vowel sign and virama. and later map them with other Indic script. I first tried using Rust's chars() which didn't…

InsParbo
- 390
- 2
- 13
2
votes
1 answer
C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly
In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a…

Jenix
- 2,996
- 2
- 29
- 58
2
votes
2 answers
Given a list of Unicode code points, how does one split them into a list of Unicode characters?
I'm writing a lexical analyzer for Unicode text. Many Unicode characters require multiple code points (even after canonical composition). For example, tuple(map(ord, unicodedata.normalize('NFC', 'ā́'))) evaluates to (257, 769). How can I know where…

Tyler Crompton
- 12,284
- 14
- 65
- 94
2
votes
2 answers
How to map Arabic letters to phonemes in Python?
I want to make a simple Python script that will map each Arabic letter to phoneme sound symbols. I have a file that has a bunch of words that the script will read to convert them to phonemes, and I have the following dictionary in my code:
Content…

0x01Brain
- 798
- 2
- 12
- 28