I want to create a Caesar cipher that can encode/decode unicode printable characters (single- and multi codepoint grapheme clusters, emojis ect.) from the whole of Unicode (except the private use area). Preferably, it will use a list of all printable characters.
NOTE: Even though I want to create a caesar cipher, it is really not about encryption. The question is about investigating the properties of unicode.
I found these questions:
What is the range of Unicode Printable Characters?
Cipher with all unicode characters
But I didn't get an answer to what I want.
Note: If you give a coding answer, I am mostly interested in a solution that uses either python3 or perl6, as they are my main languages.
Recently, I was given an assignment to write a Caesar cipher and then encode and decode an English text.
I solved it in python by using the string library's built-in string.printable constant. Here is a printout of the constant: (I used visual studio code)
[see python code and results below]
The documentation says: ''' String of ASCII characters which are considered printable. This is a combination of digits, ascii_letters, punctuation, and whitespace. ''' https://docs.python.org/3.6/library/string.html#string-constants
I am wondering how you could create a caesar cipher that could encode/decode all the possible printable characters you can make from unicode codepoints (just asume you have all necessary fonts to see those that should be visible on screen).
Here is my understanding of what it means for something to be a printable character:
When I take the python string constant above, and traverse it with the left or rigt arrow keys on the keyboard, It takes me exactly 100 strokes to get to the end (the same as the number of characters). It looks like there is a one-to-one correspondence between being a printable character and being traversible with one stroke of an arrow key.
Now consider this string:
"ij क्षि "
Based on pythons string.printable constant, This string seems to me to be composed of the following 7 printable characters: (you can look up individual codepoints at: https://unicode-table.com/en/)
1 (family) 2 (Latin Small Ligature Ij) 3 (cariage return) 4 (Devanagari kshi) 5 (space) 6 (Zero Width No-Break Space) 7 (Ace of spades)
codepoints: 128104 8205 128105 8205 128103 8205 128102 (reference: https://emojipedia.org/family-man-woman-girl-boy/)
(Latin Small Ligature Ij) ij codepoint: 307
(Carriage Return) codepoint: 13
(Devanagari kshi)
क्षि
codepoints: 2325 2381 2359 2367
(see this page: http://unicode.org/reports/tr29/)
(the codepoints seems to be in hexadecimal rather than numerals)
(Space) codepoint: 32
(Zero Width No-Break Space) codepoint: 65279 (AKA U+FEFF BYTE ORDER MARK (BOM)) (https://en.wikipedia.org/wiki/Byte_order_mark)
(Playing Card Ace of Spades) codepoint: 127137
When I paste this string into notepad, and try to traverse it with an arrow key, I end up using 10 key strokes rather than 7, because the family emoji need 4 key strokes (probably because notepad cant deal with the Zero Width Joiner, codepoint: 8205, and of course notepad cant display a family glyph). On the other hand when I post the string into google search, i can traverse the whole string with 7 strokes.
Then I tried creating the string in Perl6 to see what Perl6's grapheme awareness would make of the string:
(I use the Atom editor)
[see perl6 code and results below]
perl6 thinks that the Devanagari kshi character क्षि (4 codepoints) is actually 2 graphemes, each with 2 codepoints. Even though it CAN be represented as two characters, as seen in the above list, I think this is a bug. Perl6 is supposed to be grapheme aware, and even my windows notepad (and google search) thinks it is a single grapheme/character.
Based on the 2 strings, The practical definition of a printable character seems to be this: 'It is any combination of unicode codepoints that can get traversed by one push of a left or right arrow key on the keyboard under ideal cirkumstances'.
"under ideal cirkumstances" means that you are using an environment that, so to speak, act like google search: That is, it recognizes for example an emoji (the 4 person family) or a grapheme cluster (the devanagari character) as one printable character.
3 questions:
1: Is the above a fair definition of what it means to be a printable character in unicode?
2: Regardless of whether you accept the definition, do you know of any list of printable characters that cover the currently used unicode planes and possible grapheme clusters, rather than just the 100 ASCII characters the python string library has (If I had such a list I imagine I could create a cipher quite easily)?
3: Given that such a list does not exist, and you accept the definition, how would you go about creating such a list with which I could create a caesar cipher that could cipher any/all printable characters given the following 4 conditions?
NOTE: these 4 conditions are just what I imagine is required for a proper caesar cipher.
condition a
The string to be encrypted will be a valid utf8 string consisting of standard unicode code points (no unassigned, or private use area codepoints)
condition b
The encrypted string must also be a valid utf8 string consisting of standard unicode code points.
condition c
You must be able to traverse the encrypted string using the same number of strokes with the left or right arrow keys on the keyboard as the original string (given ideal circumstances as described above). This means that both the man-woman-boy-girl family emoji and the devanagari character, when encoded, must each correspond to exactly one other printable character and not a set of "nonsence" codepoints that the arrow keys will interpret as different characters. It also means that a single codepoint character can potentially be converted into a multi-codepoint character and vice versa.
condition d
As in any encrypt/decrypt algoritm, the string to be encrypted and the string that has been decrypted (the end result) must contain the exact same codepoints (the 2 strings must be equal).
# Python 3.6:
import string
# build-in library
print(string.printable)
print(type(string.printable))
print(len(string.printable))
# length of the string (number of ASCII characters)
#perl6
use v6;
my @ordinals = <128104 8205 128105 8205 128103 8205 128102>;
#array of the family codepoints
@ordinals.append(<307 13 2325 2381 2359 2367 32 65279 127137>);
#add the other codepoints
my $result_string = '';
for @ordinals {
$result_string = $result_string ~ $_.chr;
}
# get a string of characters from the ordinal numbers
say @ordinals; # the list of codepoints
say $result_string; # the string
say $result_string.chars; # the number of characters.
say $result_string.comb.perl; # a list of characters in the string
python results:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
class 'str'
100
perl6 results:
[128104 8205 128105 8205 128103 8205 128102 307 13 2325 2381 2359 2367 32 65279 127137]
ij क्षि
8
("", "ij", "\r", "क्", "षि", " ", "", "").Seq