15

I want to get Kanji's Unicode value. It might be something looks like let values: &[u16] = f("ののの");

When I use "の".as_bytes() I got [227, 129, 174].

When I use 'の'.escape_unicode() I got '\u306e', the 0x306e is what exactly I want.

AurevoirXavier
  • 2,543
  • 2
  • 17
  • 25
  • 1
    `'の' as u16`, hex encode. If you want to operate on an entire string and you’re confident that it’s all kanji, you *can* encode it as UTF-16. – Ry- Oct 21 '18 at 20:28
  • 1
    ...though of course if one is looking for code points then `as u32` would be highly recommended. True that utf-16 good enough for Kanji today but in general that encoding is just a mess. Many characters will fail to give the correct code point with `u16`. – Ray Toal Oct 21 '18 at 20:35
  • 1
    `"の".chars().map(|ch| ch as u32).collect::>()`, though using `.chars()` directly should be sufficient in most cases. Note that the needs more than 16 bits. – starblue Oct 22 '18 at 12:34

1 Answers1

23

The char type can be cast to u32 using as. The line

println!("{:x}", 'の' as u32);

will print "306e" (using {:x} to format the number as hex).

If you are sure all your characters are in the BMP, you can in theory also cast directly to u16. For characters from supplementary planes this will silently give wrong results, though, e.g. '' as u16 returns 0xf756 instead of the correct 0x1f756, so you need a strong reason to do this.

Internally, a char is stored as a 32-bit number, so c as u32 for some character c only reinterprets the memory representation of the character as an u32.

loganfsmyth
  • 156,129
  • 30
  • 331
  • 251
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 5
    I'd go as far as to say "don't EVER use `u16` at all!" It's just misleading and an unncessary "optimization." But kudos for working out and showing that the `as u16` silently drops off the higher-order 16 bits of the code point. That's good information to have and nicely researched. I'd suggest phrasing it more as "Don't do this" because you might know your characters are all in the BMP today, but tomorrow they might not be. – Ray Toal Oct 21 '18 at 20:38
  • Thank you. By the way, do you know how to get its Shift JIS value? Should I use a lookup table? – AurevoirXavier Oct 21 '18 at 20:40
  • @RayToal I agree and changed the wording slightly. – Sven Marnach Oct 21 '18 at 20:42
  • 1
    @AurevoirXavier I just googled that for you – here you go: https://stackoverflow.com/questions/48136939/how-do-i-use-the-shift-jis-encoding-in-rust – Sven Marnach Oct 21 '18 at 20:43