4

Is there any way to pattern match with Unicode graphemes?

As a quick example, when I run this query:

CREATE TABLE test (
    id SERIAL NOT NULL, 
    name VARCHAR NOT NULL,
    PRIMARY KEY (id), 
    UNIQUE (name)
);
INSERT INTO test (name) VALUES (' One');
INSERT INTO test (name) VALUES (' Two');

SELECT * FROM public.test WHERE test.name LIKE '%';

I get both rows returned, rather than just ' Two'. Postgres seems to be just comparing code points, but I want it to compare full graphemes, so it should only match ' Two', because is a different grapheme.

Is this possible?

TechnoSam
  • 578
  • 1
  • 8
  • 23

1 Answers1

3

It's a very interesting question!

I am not quite sure if it is possible anyway:

The skinned emojis are, in fact, two joined characters (like ligatures). The first character is the yellow hand which is followed by an emoji skin modifier

This is how the light skinned hand is stored internally. So, for me, your result makes sense:

When you query any string, that begins with , it will return:

  1. Two (trivial)
  2. _ One (ignore the underscore, I try to suppress the automated ligature with this)

So, you can see, the light skinned emoji internally also starts with . That's why I believe, that your query doesn't work the way you like.

Workarounds/Solutions:

  1. You can add a space to your query. This ensures, that there's no skin modifier after your character. Naturally, this only works in your case, where all data sets have a space after the hand:

    SELECT * FROM test WHERE name LIKE ' %';
    
  2. You can simply extend the WHERE clause like this:

    SELECT * FROM test 
    WHERE name LIKE '%'
        AND name NOT LIKE '%'
        AND name NOT LIKE '%'
        AND name NOT LIKE '%'
        AND name NOT LIKE '%'
        AND name NOT LIKE '%'
    
  3. You can use regular expression pattern matching to exclude the skins:

    SELECT * FROM test 
    WHERE name  ~ '^[^]*$'
    

see demo:db<>fiddle (note that the fiddle seems not to provide automated ligatures, so both characters are separated displayed there)

S-Man
  • 22,521
  • 7
  • 40
  • 63
  • Interesting question **and** interesting answer! – Laurenz Albe Nov 20 '20 at 15:20
  • Fair enough, thanks for the response! This string is user-supplied, so I expect it could be any number of weird grapheme clusters, like families or other modifiers. I don't think I'll try to handle every case in SQL. If it seems important, I can do some additional filtering on the results before passing them to the client. Or even just do all filtering outside of SQL, this table isn't likely to get too large. – TechnoSam Nov 20 '20 at 17:18