1

Postgres (v11) counts the red heart ❤️ as two characters, and so on for other multibyte UTF-8 chars with selector units. Anyone know how I get postgres to count true characters and not the bytes?

For example, I would like both of the examples below should return 1.

select length('❤️') = 2 (Unicode: 2764 FE0F)

select length('‍♂️') = 4 (Unicode: 1F3C3 200D 2642 FE0F)

UPDATE

Thank you to folks pointing out that postgres is correctly counting the Unicode code points and why and how this happens.

I don't see any other option other than pre-processing the emoji strings as bytes against a table of official Unicode character bytes, in Python or some such, to get the perceived length.

metalaureate
  • 7,572
  • 9
  • 54
  • 93
  • 1
    Linking [Length of string with emojis](https://stackoverflow.com/q/68063800/4299358): the count of 2 is correct - the error is you wanting a count of **combined** characters, just like `a` + `¨` (those are two characters which can be combined to `ö`). – AmigoJack Oct 16 '21 at 06:45
  • 1
    @metalaureate Based on your examples, it looks like it is counting 16bit WORDS vs 8bit bytes. – Andrew Oct 16 '21 at 12:21
  • 1
    @AmigoJack "single-character emojis, that need the U+FE0F variation selector added, e.g. Red Heart.", from your "Length of string with emojis" link, stipulates the heart is one char. I suspect a proper description would be ~"this heart char is composed of two code points". So you are sort of right (I think)-- select length counts the # of code points. Does that make sense? THANKS! :-) – Andrew Oct 16 '21 at 12:31
  • Yes. Thank you. I understand why it happens. It’s formally correct but wondered if anyone had written a function to count whole multi code point emojis. I’m guessing the answer is no and that I need to preprocess my data against the appropriate Unicode code pages. – metalaureate Oct 16 '21 at 13:46
  • 1
    Related: [user-perceived characters](https://unicode.org/reports/tr29/) and/or https://emojipedia.org/emoji-sequence/ – JosefZ Oct 16 '21 at 14:17
  • 1
    @Andrew `1F3C3` is at least 17bit long. No, it's just coincidence all other codepoints are between U+200D and U+FE0F - PostgreSQL still most likely stores that in UTF-8 and character-wise can handle up to 32bit per codepoint. – AmigoJack Oct 16 '21 at 14:33
  • What do you mean by "true characters"? Extended grapheme clusters? – nwellnhof Oct 17 '21 at 14:43
  • `Extended grapheme clusters` – metalaureate Oct 17 '21 at 16:13

1 Answers1

0

So one way to do this is to ignore all characters in the Variation Selector and decrement by 2 if you hit the General Punctuation range.

This could be converted into a postgres function.

python

"""
# For reference, these code pages apply to emojis
Name    Range
Emoticons    1F600-1F64F
Supplemental_Symbols_and_Pictographs     1F900-1F9FF
Miscellaneous Symbols and Pictographs    1F300-1F5FF
General Punctuation  2000-206F
Miscellaneous Symbols    2600-26FF
Variation Selectors  FE00-FE0F
Dingbats     2700-27BF
Transport and Map Symbols    1F680-1F6FF
Enclosed Alphanumeric Supplement     1F100-1F1FF
"""
emojis="‍♂️‍♂️‍♂️‍♂️‍♂️‍♂️‍♂️" # true count is 7, postgres length() returns 28
true_count=0
for char in emojis:
    d=ord(char)
    char_type=None 
    if (d>=0x2000 and d<=0x206F) : char_type="GP" # Zero Width Joiner
    elif (d>=0xFE00 and d<=0xFE0F) : char_type="VS" # Variation Selector
    print(d, char_type)
    if ( char_type=="GP") : true_count-=2
    elif (char_type!="VS" ):  true_count+=1
print(true_count)
metalaureate
  • 7,572
  • 9
  • 54
  • 93
  • 1
    Use `elsif` or use a [`case` construct](https://www.postgresql.org/docs/9.1/plpgsql-control-structures.html#AEN54561): if the condition `(d>=0x2000 and d<=0x206F)` is TRUE then it makes no sense to test for a different range again. But right now you do that always. – AmigoJack Oct 17 '21 at 08:45