2

Given a string that can only be a single ASCII character or an emoji char sequence. How can I tell one from another?

The idea is to separate emojis from plain text, by the spec if you are given a string of chars mixed with emojis then by doing for (..of) you can get substrings of ASCII chars and Emojis sparately

const text = 'ascii and emojis mixed'
for (const char of text) {
    // ... here, a char would be either an ASCII char or an emoji sequence string
    if (seeIfAscii(char)) {
       console.log('ASCII', char);
    } else {
       console.log('Emoji', char);
    }
}

function seeIfAscii(char) {
   // what comes here? <--- QUESTION!!!
}

As to why, I need to clump ASCII chars together and keep emojis one by one separate.

Trident D'Gao
  • 18,973
  • 19
  • 95
  • 159
  • 1
    Does this answer your question? [Emoji value range](https://stackoverflow.com/questions/30470079/emoji-value-range) – Christian Fritz Nov 12 '22 at 21:30
  • also https://stackoverflow.com/questions/147824/how-to-find-whether-a-particular-string-has-unicode-characters-esp-double-byte test here: https://jsfiddle.net/L7be6y3g/ – GrafiCode Nov 12 '22 at 21:32
  • What is an "ANSII character"? – xehpuk Nov 12 '22 at 21:51
  • 2
    You probably mean either "ANSI" or "[ASCII](https://en.wikipedia.org/wiki/ASCII)" character. These are by definition single byte. Emojis are by definition multi-byte. Assuming your source is encoded in UTF8, which is most common, an emoji "" would be [`F09F8DA6`](http://zuga.net/articles/unicode/character/1F366/). How would you separate that emoji from 4 valid ascii points `F0 9F 8D A6` ? There are some encoding issues you clearly have not defined in your question. What encoding is your data stored at. Do you read byte by byte, or do you read as UTF8 / other ? – MyICQ Nov 12 '22 at 21:58

2 Answers2

0

You can use String.charCodeAt to get the unicode number of the character:

> x = "abc"
'abc'
> x.charCodeAt(0)
97
> x.charCodeAt(3)
55357

Then the problem becomes one of defining which ranges are emojis vs. other characters. According to Emoji value range this is not a single range, but depending on your application needs it might be sufficient to call anything above 255, where the extended ANSII table ends, a character of interest.

On that note: there is a gap between ASCII characters and Emojis, and in that gap are other Unicode characters that your question doesn't specify how you'd like to treat them.

Andres Riofrio
  • 9,851
  • 7
  • 40
  • 60
Christian Fritz
  • 20,641
  • 3
  • 42
  • 71
0

You haven't specified what you want to do with characters that are not ASCII and are not emojis, such as "á", "≥", and "カ".

If you want non-ASCII characters to be treated the same as emojis (so you're detecting whether the character is ASCII or not):

function isAscii(char) {
  return char.charCodeAt(0) < 128;
}

console.log(
  isAscii('a'), // true
  isAscii('ç'), // false
  isAscii(''), // false
)

But what you probably want is to treat non-emoji characters the same as ASCII characters (so you're detecting whether the character is an emoji or not). To do this, you can use a unicode property escape as described in this answer:

function isEmoji(char) {
  return /\p{Extended_Pictographic}/u.test(char)
}

console.log(
  isEmoji('a'), // false
  isEmoji('ç'), // false
  isEmoji(''), // true
)

If as your question states, you are sure that the character can only be one of the two, then either approach is equivalent.

But keep in mind the following cases:

  • Are you sure that none of your users ever write anything in a foreign script, like "fiancée"? é is not ASCII, though you could get by with the Latin-1 Supplement: char.charCodeAt(0) < 255.
  • Are you sure none of your users use macOS or iOS? Those platforms automatically convert quotation marks into "smart" quotes “” ‘’ as the user types. These aren't in ASCII or the Latin-1 Supplement.
  • And many more cases!

So it's best to assume there will be non-emoji and non-ASCII characters and choose accordingly. Probably by treating non-ASCII and ASCII characters the same, and emojis differently, but I don't know your use case enough to tell for sure.

Andres Riofrio
  • 9,851
  • 7
  • 40
  • 60