6

I have a 1B row table of chat messages with a max-sized VARCHAR column for the raw message text. How do I -

  1. Select just the rows that contain 1 or more emoji characters
  2. Efficiently filter out rows without emoji (if needed to improve performance)
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
TJ Murphy
  • 156
  • 4
  • Welcome to Stack Overflow. Please take the 2-minute [tour]. Moreover, open [Help] and read at least [ask]. Then, [edit] your question to provide a [mcve]. Questions that only ask for code are too broad and are likely to be [put on hold or closed](https://stackoverflow.com/help/closed-questions). – JosefZ Nov 11 '21 at 16:51

2 Answers2

5

Combining the knowledge shared by Lukasz and Greg, plus discovering an additional hidden character, I can present this solution:

  • 1 JS UDF that detects the presence of emojis.
  • 1 JS UDF that detects if the string is only emojis.
create or replace function has_emoji(X text)
returns boolean
language javascript
as $$
return /\p{Extended_Pictographic}/u.test(X);
$$;

create or replace function only_emoji(X text)
returns boolean
language javascript
as $$
return /^[\p{Extended_Pictographic}\p{Emoji_Component}]+$/u.test(X);
$$;

Sample use:

with data as (
  select $1 t
  from values('❄️'),('❄️ is the data ☁️'),('no emoji')
)

select *, has_emoji(t), only_emoji(t)
from data
;

enter image description here

--

Some research:

  • '❄️'.replace(/\p{Extended_Pictographic}/ug, '') returns ''
  • but '❄️'.replace(/\p{Extended_Pictographic}/ug, '') is not equal to ''
  • Turns out there are hidden characters that still need to be replaced, and matched by Emoji_Component

enter image description here

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
0

Using Snowflake's JavaScript User Defined Function:

CREATE OR REPLACE FUNCTION EMOJI_TEST (STR VARCHAR)
  RETURNS VARCHAR
  LANGUAGE JAVASCRIPT
  AS $$
  return (/\p{Emoji}/u.test(STR));
  $$
  ;

Query:

WITH cte(col) AS (
    SELECT ''   UNION ALL
    SELECT 'text' UNION ALL
    SELECT 'text '
)
SELECT col, EMOJI_TEST(col)
FROM cte
-- WHERE EMOJI_TEST(col)::BOOLEAN

Output:

enter image description here

Related: How to detect emoji using javascript

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • That works, but I'm having trouble detecting only emoji. If `/\p{Emoji}/u.test('❄️')` testing for emoji works, why not `/^\p{Emoji}$/u.test('❄️')` for only emoji? – Felipe Hoffa Nov 11 '21 at 17:38
  • @FelipeHoffa It is strange ,the beginning of string works `/^\p{Emoji}/u.test('❄️')` so it is somehow connected with end of string and `$`(though that works for exact match `/^❄️$/u.test('❄️')` – Lukasz Szozda Nov 11 '21 at 18:24
  • Indeed... I had an answer similar to this one, but then got stuck with the "only emoji" requirement – Felipe Hoffa Nov 11 '21 at 18:48
  • @FelipeHoffa Maybe you could rewrite it as Java UDF :) – Lukasz Szozda Nov 11 '21 at 18:48
  • JS makes it so easy with `\p{Emoji}` tho.. – Felipe Hoffa Nov 11 '21 at 19:01
  • 1
    Although the OP is asking about "emoji" the specification has a *lot* of things in it that most people don't think of as emojis: https://unicode.org/Public/emoji/11.0/emoji-data.txt. For example using `{Emoji}` it will return true for #, the expression "Flowers 123" etc. It would be more accurate to change the RegEx code to `{Extended_Pictographic}` – Greg Pavlik Nov 11 '21 at 21:54
  • 1
    Extended_Pictographic + Emoji_Component finally solved the mystery... https://stackoverflow.com/a/69936510/132438 – Felipe Hoffa Nov 12 '21 at 00:19