30

I want to replace all the emoji in a string with an icon. I successfully replaced these: {:) :D :P :3 <3 XP .... etc} to icons, so if the user writes :) in a string, it will be replaced with an icon.

But I have a problem: what if user directly pastes the Unicode which is equal to :)?

What I need: How can I change the Unicode icon to JavaScript regular expressions something like \ud800-\udbff. I have many emoji, so I need an idea about converting them, and after converting them, I want to match them with regular expressions.

Example: wew
Change those emoji to \uD83D\uDE01|\uD83D\uDE4F|. I don't know how to change them, so I need to know how to change any emoji to those characters.

Nat Riddle
  • 928
  • 1
  • 10
  • 24
Mohamed Mohamed
  • 379
  • 1
  • 5
  • 12

9 Answers9

35

In ECMAScript 6 you should be able to detect it in a fairly simple way. I have compiled a simple regex comprising of different Unicode blocks namely:

Regex:

/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug

Playground: play around with emoji and regex

This answer doesn't directly answer the question but gives a fair insight on how to handle emoji using Unicode blocks and ES6.

Suhail Gupta
  • 22,386
  • 64
  • 200
  • 328
  • 1
    In case you would like also to match compound emojis (different skin, sex, theme) you should add this `\u{200d}` and instead of matching one by one, you would have to match {1,4} – Leonardo Emilio Dominguez Apr 15 '19 at 19:08
  • @LeonardoEmilioDominguez can you show the full regex please? – arkhamvm Apr 03 '20 at 08:55
  • 2
    @arkhamvm Sure! ```/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}\u{200d}]*/ug``` – Leonardo Emilio Dominguez Apr 03 '20 at 14:18
  • 1
    I'm late to the party here, but I cannot for the life of me figure out how to modify this regex to _exclude_ unicode characters. For example, if I wanted to use this to match every character that's NOT in the unicode range, how might I do that? – Chris Ferdinandi Feb 03 '21 at 15:58
13

Use unicode property escapes like this:

/\p{Emoji_Presentation}/ug
RonyHe
  • 890
  • 8
  • 11
8

I think you could also use Unicode character properties. Even Unicode Consortium themselves provide a regex, which can be adjusted for ECMAScript relatively easily (by replacing all occurrences of \x with \u and putting it all in one line). It does select possible Emoji though, meaning it will yield false positives. It's explicitly advised to still validate all matches before assuming they are in fact emoji.

Here's a somewhat stricter version of that regex which will return less false positives, with a mini demo:

const sentence = 'A ticket to 大阪 costs ¥2000 . Repeated emojis: . Crying cat: . Repeated emoji with skin tones: ✊✊✊✊✊✊. Flags: . Scales ⚖️⚖️⚖️.';

const regexpUnicodeModified = /\p{RI}\p{RI}|\p{Emoji}(\p{EMod}+|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?(\u{200D}\p{Emoji}(\p{EMod}+|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?)+|\p{EPres}(\p{EMod}+|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?|\p{Emoji}(\p{EMod}+|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})/gu
console.log(sentence.match(regexpUnicodeModified));

This will log the following:

> Array ["", "", "", "", "✊", "✊", "✊", "✊", "✊", "✊", "", "", "⚖️", "⚖️", "⚖️"]

which means it matches:

  • simple emoji
  • emoji with modifiers (skin tones)
  • country flags
  • region flags
  • emoji presentation sequences

Note that I don't see how this could be used for replacing specific emoji with images, as the OP wanted, but it does make it possible to place the emoji inside extra tags and such.

Rimas Kudelis
  • 555
  • 3
  • 12
7

Note - The below regex will match surrogate pairs (Supplemental), as well as single (Basic).

To see the hex version of what matched:
If the length of the match is 2, character 1 is a low surrogate, character 2 is a high surrogate. Just format each character to hex, and join them in a string.

You could try to match some emoji via hex ranges.

This regex matches these 1,114 emoji characters.

Note - this excludes characters in the range \x00-\x7f; for some reason there are emoji in this range like 0-9.. (using \p{Emoji=yes}).

The below regex will match these emoji.

©®‼⁉™ℹ↔↕↖↗↘↙↩↪⌚⌛⌨⏏⏩⏪⏫⏬⏭⏮⏯⏰⏱⏲⏳⏸⏹⏺Ⓜ▪▫▶◀◻◼◽◾☀☁☂☃☄☎☑☔☕☘☝☠☢☣☦☪☮☯☸
☹☺♀♂♈♉♊♋♌♍♎♏♐♑♒♓♠♣♥♦♨♻♿⚒⚓⚔⚕⚖⚗⚙⚛⚜⚠⚡⚪⚫⚰⚱⚽⚾⛄⛅⛈⛎⛏⛑⛓⛔⛩⛪⛰⛱⛲⛳⛴⛵⛷⛸⛹⛺
⛽✂✅✈✉✊✋✌✍✏✒✔✖✝✡✨✳✴❄❇❌❎❓❔❕❗❣❤➕➖➗➡➰➿⤴⤵⬅⬆⬇⬛⬜⭐⭕〰〽㊗㊙
















Regex

(?:[\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9-\u21AA\u231A-\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA-\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614-\u2615\u2618\u261D\u2620\u2622-\u2623\u2626\u262A\u262E-\u262F\u2638-\u263A\u2640\u2642\u2648-\u2653\u2660\u2663\u2665-\u2666\u2668\u267B\u267F\u2692-\u2697\u2699\u269B-\u269C\u26A0-\u26A1\u26AA-\u26AB\u26B0-\u26B1\u26BD-\u26BE\u26C4-\u26C5\u26C8\u26CE-\u26CF\u26D1\u26D3-\u26D4\u26E9-\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733-\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763-\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934-\u2935\u2B05-\u2B07\u2B1B-\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299]|(?:\uD83C[\uDC04\uDCCF\uDD70-\uDD71\uDD7E-\uDD7F\uDD8E\uDD91-\uDD9A\uDDE6-\uDDFF\uDE01-\uDE02\uDE1A\uDE2F\uDE32-\uDE3A\uDE50-\uDE51\uDF00-\uDF21\uDF24-\uDF93\uDF96-\uDF97\uDF99-\uDF9B\uDF9E-\uDFF0\uDFF3-\uDFF5\uDFF7-\uDFFF]|\uD83D[\uDC00-\uDCFD\uDCFF-\uDD3D\uDD49-\uDD4E\uDD50-\uDD67\uDD6F-\uDD70\uDD73-\uDD7A\uDD87\uDD8A-\uDD8D\uDD90\uDD95-\uDD96\uDDA4-\uDDA5\uDDA8\uDDB1-\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDE8\uDDEF\uDDF3\uDDFA-\uDE4F\uDE80-\uDEC5\uDECB-\uDED2\uDEE0-\uDEE5\uDEE9\uDEEB-\uDEEC\uDEF0\uDEF3-\uDEF6]|\uD83E[\uDD10-\uDD1E\uDD20-\uDD27\uDD30\uDD33-\uDD3A\uDD3C-\uDD3E\uDD40-\uDD45\uDD47-\uDD4B\uDD50-\uDD5E\uDD80-\uDD91\uDDC0]))  

Expanded

 (?:
      [\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9-\u21AA\u231A-\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA-\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614-\u2615\u2618\u261D\u2620\u2622-\u2623\u2626\u262A\u262E-\u262F\u2638-\u263A\u2640\u2642\u2648-\u2653\u2660\u2663\u2665-\u2666\u2668\u267B\u267F\u2692-\u2697\u2699\u269B-\u269C\u26A0-\u26A1\u26AA-\u26AB\u26B0-\u26B1\u26BD-\u26BE\u26C4-\u26C5\u26C8\u26CE-\u26CF\u26D1\u26D3-\u26D4\u26E9-\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733-\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763-\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934-\u2935\u2B05-\u2B07\u2B1B-\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299] 
   |  
      (?:
           \uD83C [\uDC04\uDCCF\uDD70-\uDD71\uDD7E-\uDD7F\uDD8E\uDD91-\uDD9A\uDDE6-\uDDFF\uDE01-\uDE02\uDE1A\uDE2F\uDE32-\uDE3A\uDE50-\uDE51\uDF00-\uDF21\uDF24-\uDF93\uDF96-\uDF97\uDF99-\uDF9B\uDF9E-\uDFF0\uDFF3-\uDFF5\uDFF7-\uDFFF] 
        |  \uD83D [\uDC00-\uDCFD\uDCFF-\uDD3D\uDD49-\uDD4E\uDD50-\uDD67\uDD6F-\uDD70\uDD73-\uDD7A\uDD87\uDD8A-\uDD8D\uDD90\uDD95-\uDD96\uDDA4-\uDDA5\uDDA8\uDDB1-\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDE8\uDDEF\uDDF3\uDDFA-\uDE4F\uDE80-\uDEC5\uDECB-\uDED2\uDEE0-\uDEE5\uDEE9\uDEEB-\uDEEC\uDEF0\uDEF3-\uDEF6] 
        |  \uD83E [\uDD10-\uDD1E\uDD20-\uDD27\uDD30\uDD33-\uDD3A\uDD3C-\uDD3E\uDD40-\uDD45\uDD47-\uDD4B\uDD50-\uDD5E\uDD80-\uDD91\uDDC0] 
      )
 )
tripleee
  • 175,061
  • 34
  • 275
  • 318
3

You can change to \U characters with below function.

var emojiToUnicode=function (message){
    var emojiRegexp = /([\uE000-\uF8FF]|\uD83C[\uDC00-\uDFFF]|\uD83D[\uDC00-\uDFFF]|[\u2694-\u2697]|\uD83E[\uDD10-\uDD5D])/g;
    if(!message)
        return;
    try{ 
        var newMessage = message.match(emojiRegexp);
        for(var emoj in newMessage){
              var emojmessage = newMessage[emoj];
              var index = message.indexOf(emojmessage);
              if(index === -1)
                  continue;
              emojmessage = "\\u" + emojmessage.charCodeAt(0).toString(16) + "\\u" + emojmessage.charCodeAt(1).toString(16);
              message = message.substr(0, index) + emojmessage + message.substr(index + 2);
            }
        return message;
    }catch(err){
        console.error("error in emojiToUnicode"+err.stack);
    }
 };
satya test
  • 65
  • 1
  • 9
2

A lot of the suggested patterns do not match Modifier Sequence emojis (skin tones) or compound emojis correctly, or are outdated and don't match newer emojis, or both.

Consider this doozy of an emoji and the regular expression that would match it:

console.log("‍❤️‍‍".split('').map(function(chr) { return '\\u' + chr.charCodeAt(0).toString(16); }).join(''))

That's quite the pattern. It's because it's a bunch of other emojis joined with the U+200D ZERO WIDTH JOINER:

+ U+200D + ❤️‍ + U+200D + ‍ + U+200D +

So, you want your pattern to match the longer sequences first or you'll match those "inner emojis" erroneously.

Solution? Use a pattern like this, which, while long, is drop dead simple because it's a single alternation (?:longest|secondLongest|....|secondShortest|shortest): https://github.com/sweaver2112/Regex-combined-emojis/blob/master/regex.js

Here's a working example:

/*compile the pattern string into a regex*/
let emoRegex = new RegExp(emojiPattern, "g")

/*extracting the emojis*/
let emojis = [..."This ‍⚖️is the ‍♀️text.".matchAll(emoRegex)];
console.log(emojis)

/*count of emojis*/
let emoCount = [..."This ‍⚖️is the ‍♀️text.".matchAll(emoRegex)].length
console.log(emoCount)

/*strip emojis from text*/
let stripped = "This ‍⚖️is the ‍♀️text.".replaceAll(emoRegex, "")
console.log(stripped)

/*use the pattern string to build a custom regex*/
let customRegex = new RegExp(".*"+emojiPattern+"{3}$") //match a string ending in 3 emojis
console.log(customRegex.test("yep three here ‍⚖️"))
console.log(customRegex.test("nope "))
<script src="https://gitcdn.link/repo/sweaver2112/Regex-combined-emojis/master/regex.js"></script>

Regex 101 Demo matches all 3521 Emojis as of May 2021

The demo includes all characters from *https://unicode.org/emoji/charts/full-emoji-list.html and *https://unicode.org/emoji/charts-13.1/full-emoji-modifiers.html:

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43
2

Here is what I am using:

var regexp = /((\ud83c[\udde6-\uddff]){2}|([\#\*0-9]\u20e3)|(\u00a9|\u00ae|[\u2000-\u3300]|[\ud83c-\ud83e][\ud000-\udfff])((\ud83c[\udffb-\udfff])?(\ud83e[\uddb0-\uddb3])?(\ufe0f?\u200d([\u2000-\u3300]|[\ud83c-\ud83e][\ud000-\udfff])\ufe0f?)?)*)/g

very short compared to many other solutions, it will cover pretty much everything, flags, surrogates, combinations with gender and skin tone or other emojis.

a downside of it might be the fact that it will cover more than just the well known emojis (but this can be seen as a good thing as well because if a new emoji will be released, is a good chance to cover it as well)

Here is the usage of it to replace the unicode emojis with an img tag

function emojiToHex(charSet) {
    var comb = [];
    for (var e1, e2, i = 0; i < charSet.length; i += 1) {
        e1 = charSet.charCodeAt(i);

        // Surrogate char
        if (e1 >= 0xD800 && e1 <= 0xDC00) {
            e2 = charSet.charCodeAt(i + 1);
            i++;
            comb.push((
                (e1 - 0xD800) * 0x400
                + (e2 - 0xDC00) + 0x10000
            ).toString(16));
        } else {
            comb.push(e1.toString(16));
        }

    }

    return comb.join('-');
}

function getEmojiImage(charSet) {
    return '<img alt="' + charSet + '" src="https://your.cdn/' + emojiToHex(charSet) + '.png" />';
}

container.innerHTML = text
            .replace(regexp, getEmojiImage);
Antal Alin
  • 156
  • 1
  • 8
2

This question really helped me when replacing emoji by Noto Emoji images. I didn't want to include a big library for what is basically this:

function emojiToFilename(emoji) {
  return [...emoji].map(char => char.codePointAt(0).toString(16).padStart(4, '0')).join('_').replace(/_fe0f/g, '');
}

function emojis2images(dom) {
  const regexpUnicodeModified = /\uD83C\uDFF4(\uDB40[\uDC61-\uDC7A])+\uDB40\uDC7F|(\ud83c[\udde6-\uddff]){2}|([\#\*0-9]\ufe0f?\u20e3)|(\u00a9|\u00ae|[\u203c-\u3300]|[\ud83c-\ud83e][\ud000-\udfff])((\ud83c[\udffb-\udfff])?(\ud83e[\uddb0-\uddb3])?(\ufe0f?\u200d([\u2000-\u3300]|[\ud83c-\ud83e][\ud000-\udfff])\ufe0f?)?)*/g;
  dom.innerHTML = dom.innerHTML.replace(regexpUnicodeModified, function(m, g1, g2) {
    if(g1 || g2)
      return `<img src="https://raw.githubusercontent.com/googlefonts/noto-emoji/main/third_party/region-flags/waved-svg/emoji_u${emojiToFilename(m)}.svg" alt="${m}">`;
    else
      return `<img src="https://raw.githubusercontent.com/googlefonts/noto-emoji/main/svg/emoji_u${emojiToFilename(m)}.svg" alt="${m}">`;
  });
}

window.onload = function() {
  emojis2images(document.getElementById('emojis'));
}
p {
  font-size: 16px;
  white-space: pre;
}

img {
  width: 1.2em;
  vertical-align: bottom;
}
<p id="emojis">
  skin tones: ✊‍
  Flags: 
  Scales ⚖️⚖️⚖️
  Keycaps: 1️⃣1⃣

  Emoji V14: ‍
  Emoji V15: ‍⬛
  Emoji V15.1: ‍‍‍
</p>

This is basically using https://stackoverflow.com/a/69866962/17169707 but I inserted \uD83C\uDFF4(\uDB40[\uDC61-\uDC7A])+\uDB40\uDC7F to match subdivision-flags and changed \u2000-\u3300 to \u203c-\u3300 because it was also matching something like or .

I tried to use \p{Emoji} but as far as I can tell these classes do not work if the operating system or browser do not know about the unicode characters. In my case, my system doesn't support Emoji V15 yet so it wouldn't match those emoji. IMO, this kinda defeats the purpose because I'm replacing the emoji with images because they're not supported yet on every platform.

You can see an example working on https://unicode.org/Public/emoji/15.0/emoji-sequences.txt here: https://jsfiddle.net/r8gef2tc/

  • Please note that the example links to the SVGs on GitHub directly. Those URLs might break and should not be used in production. Download and host Noto Emoji yourself instead.
  • The example on jsfiddle also adds (${m}) behind the image so you can compare the output between native browser rendering and the resulting image.
  • The example contains Emoji V15.1: ‍‍‍. Emoji Version 15.1 is only a draft at the time of writing this. Noto Emoji does not provide an image for this family emoji yet. The regular expression doesn't know that and tries to load an image anyway. When that fails, the image gets replaced by its alt text which is the source emoji again. If your system doesn't support this new family emoji, it will show its four components instead.
1

With the newly introduced v flag, matching emojis is no longer a problem:

console.log(/\p{RGI_Emoji}/v.test('‍❤️‍‍⚕️'))           // checks if present... true
console.log([...'‍❤️‍‍⚕️'.matchAll(/\p{RGI_Emoji}/gv)]) // =>[["\u200d❤️\u200d"],["\u200d⚕️"]]

More details:

This code snippet refers to the property of strings RGI_Emoji, which Unicode defines as “the subset of all valid emoji (characters and sequences) recommended for general interchange”. With this, we can now match emoji regardless of how many code points they consist of under the hood!

The v flag enables support for the following Unicode properties of strings from the get-go:

Basic_Emoji
Emoji_Keycap_Sequence
RGI_Emoji_Modifier_Sequence
RGI_Emoji_Flag_Sequence
RGI_Emoji_Tag_Sequence
RGI_Emoji_ZWJ_Sequence
RGI_Emoji

This list of supported properties might grow in the future as the Unicode Standard defines additional properties of strings. Although all current properties of strings happen to be emoji-related, future properties of strings might serve entirely different use cases.

The team also considers porting these features back to the u flag.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563