Finding Emojis in Strings

Question

So I'm trying to find and replace emojis in strings. This is my approach with regexp so far.

const replaceEmojis = function (string) {
    String.prototype.regexIndexOf = function (regex, startpos) {
        const indexOf = this.substring(startpos || 0).search(regex);
        return (indexOf >= 0) ? (indexOf + (startpos || 0)) : indexOf;
    }
    // generate regexp
    let regexp;
    try {
        regexp = new RegExp('\\p{Emoji}', "gu");
    } catch (e) {
        //4 firefox <3
        regexp = new RegExp(`(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])`, 'g');
    }

    // get indices of all emojis
    function getIndicesOf(searchStr, str) {
        let index, indices = [];

        function getIndex(startIndex) {
            index = str.regexIndexOf(searchStr, startIndex);
            if (index === -1) return;
            indices.push(index);
            getIndex(index + 1)
        }

        getIndex(0);

        return indices;
    }

    const emojisAt = getIndicesOf(regexp, string);

    // replace emojis with SVGs
    emojisAt.forEach(index => {
        // got nothing here yet
        // const unicode = staticHTML.charCodeAt(index); //.toString(16);
    })

The problem with this is that I only get an array with indices where the emojis are in the string. But with only these indices I can't replace them because I don't know how many (UTF-16) bytes they take up. Also for replacing them I need to know what emoji it is I am replacing.

So, is there a way to also get the length of the emoji? Or is there a better (perhaps simpler) way than mine to replace emojis?

What exactly do you consider emoji? There are many thousands of these things. What is your goal? — Brad, Apr 14 '20 at 17:21
@Brad I mean all [Emojis in Unicode](http://unicode.org/emoji/charts/full-emoji-list.html) including skin tones, basically everything you select with the `/\p{Emoji}/gu` expression. I want to replace them with SVGs of emojis. The idea is to locate an emoji, get its unicode (not charcode) and replace it with an SVG. One (long) unicode however gets encoded with multiple UTF-16 char codes, so emojis are various UTF-16 characters long, so i need to know how "long" an emoji is to replace it. — drinking-code, Apr 14 '20 at 17:56

drinking-code · Answer 1 · 2020-04-15T20:17:47.423

Alright, so turns out I just had a little bit of a mental block.
To find the emojis I don't need to get the indices as WolverinDEV mentioned. Although just using string.replace with /\p{Emoji}/gu does't work as this breaks up e.g. ‍♂️ into ,, and ♂. So I tweaked the regexp to account for that: /[\p{Emoji}\u200d]+/gu. Now the emoji is returned in full because zero width joiner are included.
This is what I got (if anyone cares):

const replaceEmojis = function (string) {
    const emojis = string.match(/[\p{Emoji}\u200d]+/gu);
    // console.log(emojis);

    // replace emojis with SVGs
    emojis.forEach(emoji => {
        // get the unicodes of the emoji
        let unicode = "";

        function getNextChar(pointer) {
            const subUnicode = emoji.codePointAt(pointer);
            if (!subUnicode) return;
            unicode += '-' + subUnicode.toString(16);
            getNextChar(++pointer);
        }

        getNextChar(0);

        unicode = unicode.substr(1); // remove the beginning dash '-'
        console.log(unicode.toUpperCase());

        // replace emoji here
        // string = string.replace(emoji, `<svg src='path/to/svg/${unicode}.svg'>`)
    })

    return string;
}

This still needs work, e.g. as there are Low Surrogates in the outputted unicode, but fundamentally, this works.

EDIT:

First improvement:
You may don't need this but to get rid of low surrogate characters add a condition in getNextChar()

if (!(subUnicode >= 56320 && subUnicode <= 57343)) unicode += '-' + subUnicode.toString(16);

This only adds the character code if it isn't a low surrogate character.

Second improvement:
Add the variation selector 16 (U+FE0F) to the regexp to select more emojis en bloc:

/[\p{Emoji}\u200d\ufe0f]+/gu

To avoid matching against numbers in the string, use `\p{Extended_Pictographic}` instead (from https://stackoverflow.com/questions/18862256/how-to-detect-emoji-using-javascript). Also a list of test cases for ZWJ emoji sequences: https://unicode.org/emoji/charts/emoji-zwj-sequences.html. — Sentient, Feb 20 '21 at 00:04

score 0 · Answer 2 · answered Apr 14 '20 at 22:50

0

Well you've already a working RegExp so you could use String.replace:

string.replace(regexp, my_emojy => { 
    return "<an emoji was here>";
});

So you've no need at all to find any indices.

answered Apr 14 '20 at 22:50

WolverinDEV

1,494
9
19

Yes! Thanks you for your answer! I re-thought the hole thing and also saw that `string.match(regexp)` is an option to just get the emojis and then replacing them inside the `forEach` with `string.replace`. Although your solution is much simpler I'll do it separately because unfortunately I have to account for all the identifiers (skin tone, gender, connectors, variation selectors, etc.). – drinking-code Apr 14 '20 at 23:13

score 0 · Answer 3 · answered Apr 21 '23 at 03:55

First of all: \p{Emoji} is not what you need.

Which Length-One Characters Match Against `\p{Emoji}`?

I'm assuming we are working within the first unicode plane which includes all the characters we "commonly" use, that's more than 65500 code points, so let's use JavaScript to get the items that match against \p{Emoji}:

console.log(...(new Array(2 ** 16)).fill(null).reduce((characters, _, i) => characters.concat(String.fromCodePoint(i)), '').match(/\p{Emoji}/gu));

Luckily, we can easily extract the characters we are interested in (#*0123456789) from the above results.

How to Match Emojis Properly

Actually, the unicode property Emoji is not intended to do this job as described here: Unicode® Standard Annex #44 - UNICODE CHARACTER DATABASE - Property Definitions (Emoji Data). Yes, it does match emojis, but we are also asking it to match several emojis combined as one. This is a job for a different regex, the one described here at Unicode® Technical Standard #51 - UNICODE EMOJI - EBNF and Regex.

Based on it we can build this ugly but effective emoji regex:

const emojiRegex = /\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\uFE0F\u20E3?|[\u{E0020}-\u{E007E}]+\u{E007F})?(\u200D(\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\uFE0F\u20E3?|[\u{E0020}-\u{E007E}]+\u{E007F})?))*/gu;

Answer

Putting it all together:

const emojiBlaskList = '#*0123456789';
const emojiRegex = /\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\uFE0F\u20E3?|[\u{E0020}-\u{E007E}]+\u{E007F})?(\u200D(\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\uFE0F\u20E3?|[\u{E0020}-\u{E007E}]+\u{E007F})?))*/gu;

function replaceAllEmojis(string) {
  const emojis = (string.match(emojiRegex) || []).filter(emoji => !emojiBlaskList.includes(emoji));

  if (emojis.length === 0) {
    return string; // Nothing to do here.
  }

  let noEmojis = string;

  for (const emoji of emojis) {
    noEmojis = noEmojis.replace(emoji, '');
  }

  return noEmojis;
}

// Mixed:
console.log(replaceAllEmojis('☺123 !@#$%^asd⚕♻⚜☑✔❌〽✳©®™#️⃣*️⃣0️⃣1️⃣2️⃣‍♂️‍♂️‍♂️‍❤️‍‍‍❤️‍‍'));

// No-emojis only:
console.log(replaceAllEmojis('#*0123456789'));

// Emojis only:
console.log(replaceAllEmojis('✔❌‍❤️‍‍'));

This implementation is just a demo, teak/improve it as you need.

Finding Emojis in Strings

3 Answers3

EDIT:

Which Length-One Characters Match Against \p{Emoji}?

How to Match Emojis Properly

Answer

Which Length-One Characters Match Against `\p{Emoji}`?