13

in my js I am trying to substring() text which generally works but unfortunately decapitates emojis.

usaText = "AZ"
splitText = usaText.substring(0,2) //"A�"
splitText = usaText.substring(0,3) //"A"
splitText = usaText.substring(0,4) //"A�"
splitText = usaText.substring(0,5) //"A"

Is there a way to use substring without breaking emoji? In my production code I cut at about 40 characters and I wouldn't mind if it was 35 or 45. I have thought about simply checking whether the 40th character is a number or between a-z but that wouldn't work if you got a text full of emojis. I could check whether the last character is one that "ends" an emoji by pattern matching but this also seems a bit weird performance-wise.

Am I missing something? With all the bloat that JavaScript carries, is there no built-in count that sees emoji as one?

To the Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters") thing:

chrs = Array.from( usaText )
(4) ["A", "", "", "Z"]
0: "A"
1: ""
2: ""
3: "Z"
length: 4

That's one too many unfortunately.

user2875404
  • 3,048
  • 3
  • 25
  • 47
  • You might consider looking for emojis, log where they are, then remove them. Then do the substring, then put the emojis into the substrings based on where they were in the original string. The substrings won't be the same length anymore, but you say that isn't an issue. – RobG Sep 26 '18 at 22:26
  • 3
    Forget about "emoji", you're asking about surrogate pair UTF-16, applying to normal languages as much as they do to emoji. There is an elegant solution for this, already answered over on https://stackoverflow.com/questions/21397316/split-javascript-string-into-array-of-codepoints-taking-into-account-surrogat, consisting of using `Array.from(yourstring)`, which will split your string into individual unicode characters without breaking them between bytes. – Mike 'Pomax' Kamermans Sep 26 '18 at 22:32
  • Please check my code. I did try that already and while it made my situation a bit better it still leaves me with 2 parts. – user2875404 Sep 26 '18 at 22:41

3 Answers3

12

So this isn't really an easy thing to do, and I'm inclined to tell you that you shouldn't write this on your own. You should use a library like runes.

Just a simple npm i runes, then:

const runes = require('runes');
const usaText = "AZ";
runes.substr(usaText, 0, 2); // "A"
MichaelSolati
  • 2,847
  • 1
  • 17
  • 29
  • 2
    The runes code also is simply-written enough that it makes a very good introduction to the major grapheme cluster splitting problems. I highly recommend reading both the code and the test cases. https://github.com/dotcypress/runes/blob/develop/index.js – Rob Napier Sep 27 '18 at 13:14
  • 2
    `runes(usaText)` -> `(3) ["A", "", "Z"]`. Perfect, thanks! – user2875404 Sep 27 '18 at 13:45
3

Disclaimer: This is just extending the above comment by Mike 'Pomax' Kamermans because to me it is actually a much simpler, applicable answer (for those of us who don't like reading through all the comments):

Array.from(str) splits your string into individual unicode characters without breaking them between bytes.

See Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters") for details.

E. Villiger
  • 876
  • 10
  • 27
2

This code has worked for me :

splitText = Array.from(usaText).slice(0, 5).join('');
hs_dino
  • 37
  • 2
  • Welcome to stackoverflow. In addition to the answer you've provided, please consider providing a brief explanation of why and how this fixes the issue. – jtate Apr 24 '20 at 15:03
  • 2
    Hey, `(0, 2)` on your code results in `A`. Usually one would either want the emoji included completely or not at all - instead of getting broken fractions – user2875404 Apr 25 '20 at 15:56
  • this is the correct answer. Not sure why it's not green – Tengiz Jun 23 '21 at 17:26