2

I have a String named fancy, the String fancy is this "", however, I need to make "lmao" out of it.

I've tried calling String#trim, however with no success.

Example code:

var fancy = ""
var normal = //Magic to convert  to lmao

EDIT: So I figured out, if I take the UTF-8 code of this fancy character, and subtract it by 120101, I get the original character, however, there are more types of these fancy texts so it does not seem like a solution for my problem.

kyngs
  • 75
  • 1
  • 6
  • take a look at UTF-8 charset table or any other charset table you might be using to have thise letters. once done, you can affect a char code from your fancy alphabet to an ascii char code and then have it converted – midugh Jan 22 '21 at 21:16
  • You can use decomposition: `String normal = java.text.Normalizer.normalize(fancy, java.text.Normalizer.Form.NFKD);`. See [this question](https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette) for background. In your case you do need `Normalizer.Form.NFKD`. – andrewJames Jan 22 '21 at 21:41
  • @andrewjames This is unfortunately an unsuitable solution because I need to be able to write diacritical marks like čšúů etc. For context: I write a discord bot, and I want to remove messages with this fancy thingy and replace them with normal ones, however, a lot of people will write messages with čšúů etc. and this would completely break them. – kyngs Jan 22 '21 at 21:44
  • You should not lose diacritics this way - sorry if the linked question is misleading, here (which is why it's not a duplicate). Removal of diacritics is an extra step following the decomposition step I mentioned. I think `NKFC` should also work, in your context. – andrewJames Jan 22 '21 at 21:54

2 Answers2

5

You can take advantage of the fact that your "" character decomposes to a regular "a":

Decomposition LATIN SMALL LETTER A (U+0061)

Java's java.text.Normalizer class contains different normalizer forms. The NKFD and NKFC forms use the above decomposition rule.

String normal = Normalizer.normalize(fancy, Normalizer.Form.NFKC);

Using compatibility equivalence is what you need here:

Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.

(The reason you do not lose diacritics is because this process simply separates these diacritic marks from their base letters - and then re-combines them if you use the relevant form.)

andrewJames
  • 19,570
  • 8
  • 19
  • 51
0

Those are unicode characters: https://unicode-table.com also provides reverse lookup to identify them (copy-paste them into the search).

The fancy characters identify as:

  • Mathematical Bold Fraktur Small L (U+1D591)
  • Mathematical Bold Fraktur Small M 'U+1D592)
  • Mathematical Bold Fraktur Small A (U+1D586)
  • Mathematical Bold Fraktur Small O (U+1D594)

You also find them as 'old style english alphabet' on this list: https://unicode-table.com/en/sets/fancy-letters. There we notice that they are ordered and in the same way that the alphabetic characters are. So the characters have a fixed offset:

int offset = 0x1D586 - 'a' //  is U+1D586

You can thus transform the characters back by subtracting that offset.

Now comes the tricky part: these unicode code points cannot be represented by a single char data type, which is only 16 bit, and thus cannot represent every single unicode character on its own (1-4 chars are actually needed, depending on unicode char). The proper way to deal with this is to work with the code points directly:

String fancy = "";
  
int offset = 0x1D586 - 'a' //  is U+1D586

String plain = fancy.codePoints()
    .map(i-> i - offset)
    .mapToObj(c-> (char)c)
    .map(String::valueOf)
    .collect(java.util.stream.Collectors.joining());

System.out.println(plain);

This then prints lmao.

Peter Walser
  • 15,208
  • 4
  • 51
  • 78