1

I want to create a Caesar cipher that can encode/decode unicode printable characters (single- and multi codepoint grapheme clusters, emojis ect.) from the whole of Unicode (except the private use area). Preferably, it will use a list of all printable characters.

NOTE: Even though I want to create a caesar cipher, it is really not about encryption. The question is about investigating the properties of unicode.

I found these questions:

What is the range of Unicode Printable Characters?

Cipher with all unicode characters

But I didn't get an answer to what I want.

Note: If you give a coding answer, I am mostly interested in a solution that uses either python3 or perl6, as they are my main languages.

Recently, I was given an assignment to write a Caesar cipher and then encode and decode an English text.

I solved it in python by using the string library's built-in string.printable constant. Here is a printout of the constant: (I used visual studio code)

[see python code and results below]

The documentation says: ''' String of ASCII characters which are considered printable. This is a combination of digits, ascii_letters, punctuation, and whitespace. ''' https://docs.python.org/3.6/library/string.html#string-constants

I am wondering how you could create a caesar cipher that could encode/decode all the possible printable characters you can make from unicode codepoints (just asume you have all necessary fonts to see those that should be visible on screen).

Here is my understanding of what it means for something to be a printable character:

When I take the python string constant above, and traverse it with the left or rigt arrow keys on the keyboard, It takes me exactly 100 strokes to get to the end (the same as the number of characters). It looks like there is a one-to-one correspondence between being a printable character and being traversible with one stroke of an arrow key.

Now consider this string:

"‍‍‍ij क्षि "

Based on pythons string.printable constant, This string seems to me to be composed of the following 7 printable characters: (you can look up individual codepoints at: https://unicode-table.com/en/)

1 (family) 2 (Latin Small Ligature Ij) 3 (cariage return) 4 (Devanagari kshi) 5 (space) 6 (Zero Width No-Break Space) 7 (Ace of spades)

‍‍‍ codepoints: 128104 8205 128105 8205 128103 8205 128102 (reference: https://emojipedia.org/family-man-woman-girl-boy/)

(Latin Small Ligature Ij) ij codepoint: 307

(Carriage Return) codepoint: 13

(Devanagari kshi) क्षि codepoints: 2325 2381 2359 2367
(see this page: http://unicode.org/reports/tr29/) (the codepoints seems to be in hexadecimal rather than numerals)

(Space) codepoint: 32

(Zero Width No-Break Space) codepoint: 65279 (AKA U+FEFF BYTE ORDER MARK (BOM)) (https://en.wikipedia.org/wiki/Byte_order_mark)

(Playing Card Ace of Spades) codepoint: 127137

When I paste this string into notepad, and try to traverse it with an arrow key, I end up using 10 key strokes rather than 7, because the family emoji need 4 key strokes (probably because notepad cant deal with the Zero Width Joiner, codepoint: 8205, and of course notepad cant display a family glyph). On the other hand when I post the string into google search, i can traverse the whole string with 7 strokes.

Then I tried creating the string in Perl6 to see what Perl6's grapheme awareness would make of the string:

(I use the Atom editor)

[see perl6 code and results below]

perl6 thinks that the Devanagari kshi character क्षि (4 codepoints) is actually 2 graphemes, each with 2 codepoints. Even though it CAN be represented as two characters, as seen in the above list, I think this is a bug. Perl6 is supposed to be grapheme aware, and even my windows notepad (and google search) thinks it is a single grapheme/character.

Based on the 2 strings, The practical definition of a printable character seems to be this: 'It is any combination of unicode codepoints that can get traversed by one push of a left or right arrow key on the keyboard under ideal cirkumstances'.

"under ideal cirkumstances" means that you are using an environment that, so to speak, act like google search: That is, it recognizes for example an emoji (the 4 person family) or a grapheme cluster (the devanagari character) as one printable character.

3 questions:

1: Is the above a fair definition of what it means to be a printable character in unicode?

2: Regardless of whether you accept the definition, do you know of any list of printable characters that cover the currently used unicode planes and possible grapheme clusters, rather than just the 100 ASCII characters the python string library has (If I had such a list I imagine I could create a cipher quite easily)?

3: Given that such a list does not exist, and you accept the definition, how would you go about creating such a list with which I could create a caesar cipher that could cipher any/all printable characters given the following 4 conditions?

NOTE: these 4 conditions are just what I imagine is required for a proper caesar cipher.

condition a

The string to be encrypted will be a valid utf8 string consisting of standard unicode code points (no unassigned, or private use area codepoints)

condition b

The encrypted string must also be a valid utf8 string consisting of standard unicode code points.

condition c

You must be able to traverse the encrypted string using the same number of strokes with the left or right arrow keys on the keyboard as the original string (given ideal circumstances as described above). This means that both the man-woman-boy-girl family emoji and the devanagari character, when encoded, must each correspond to exactly one other printable character and not a set of "nonsence" codepoints that the arrow keys will interpret as different characters. It also means that a single codepoint character can potentially be converted into a multi-codepoint character and vice versa.

condition d

As in any encrypt/decrypt algoritm, the string to be encrypted and the string that has been decrypted (the end result) must contain the exact same codepoints (the 2 strings must be equal).

# Python 3.6:
import string           
    # build-in library
print(string.printable)      
print(type(string.printable))   
print(len(string.printable))    
    # length of the string (number of ASCII characters)


#perl6
use v6;

my @ordinals = <128104 8205 128105 8205 128103 8205 128102>;    
    #array of the family codepoints
@ordinals.append(<307 13 2325 2381 2359 2367 32 65279 127137>); 
    #add the other codepoints
my $result_string = '';
for @ordinals {
    $result_string = $result_string ~ $_.chr;  
}
    # get a string of characters from the ordinal numbers
say @ordinals;              # the list of codepoints
say $result_string;             # the string
say $result_string.chars;       # the number of characters.
say $result_string.comb.perl;   # a list of characters in the string

python results:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[]^_`{|}~

class 'str'

100

perl6 results:

[128104 8205 128105 8205 128103 8205 128102 307 13 2325 2381 2359 2367 32 65279 127137]

‍‍‍ij क्षि

8

("‍‍‍", "ij", "\r", "क्", "षि", " ", "", "").Seq

Arimaafan
  • 167
  • 2
  • 7
  • 1
    hmmm i just posted this question 10 minutes ago, and I already have -2. What is wrong with my question? – Arimaafan Jan 19 '19 at 16:36
  • Note that there's an infinite number of Unicode grapheme clusters because clusters can contain an arbitrary number of combining marks. So your cipher should probably be based on single codepoints. – nwellnhof Jan 19 '19 at 16:48
  • There is a vast number of possible combinations of unicode codepoints, but only a comparatively tiny number of them will be grapheme clusters. For example if you put an 'A' (codepoint 65) and a 'B'(codepoint 66) next to each other, they wont form a grapheme cluster.so I think it is perfectly reasonably to search for a complete list of printable characters for a given version of unicode (11.0 I think is the most recent one). – Arimaafan Jan 19 '19 at 17:43
  • @mwellnhof I get what you are saying, but I think that there are other things to consider. I will make an edit to my question and explain what I think. (my old comment whent stale and I cant edit it) – Arimaafan Jan 19 '19 at 18:00
  • It is complex. Some language go to right, some to left, when encoded, the direction could change, so making difficult to reconstruct the final string [when printed, as an array no problem]. I don't agree with @nwellnhof: there is a maximum number of combining characters (for a well defined string), but in any case, on many letters you can add accent and other modifiers in different order. (and IIRC unicode restrict to 16, so a power of 16 of the number of such modifiers). Not all characters can be distinguished (in an annex). And Caesar-cipher is very week: so do not over-engineer the problem. – Giacomo Catenazzi Jan 19 '19 at 18:35
  • 1
    My recommendation: encode the unicode string, and then encode the encoding with BASE64. Then you have 64 symbols and use caesar on them. In any case: THIS IS A WEAK ENCRYPTION. DOES NOT USE IN REAL CASES. – Giacomo Catenazzi Jan 19 '19 at 18:41
  • 2
    The complexity of text motivates modern ciphers to not deal with characters. They deal with bytes for both input and output. Where the input is text, the receiver has to to be told the character encoding. Where further handling of the output must be as text, Base64 or similar is used. Expecting the output of a cipher to look like text is not generally a goal. – Tom Blodget Jan 19 '19 at 18:43
  • 1
    @GiacomoCatenazzi In general, there's no limit on the number of combining characters. The [stream-safe text format](http://www.unicode.org/reports/tr15/#Stream_Safe_Text_Format) specifies a maximum of 30 non-starters but this format isn't mandatory. – nwellnhof Jan 19 '19 at 18:50
  • @GiacomoCatenazzi It is not about encrypting a message. It is about playing around with unicode and investigating its properties. What do you mean by "Not all characters can be distinguished (in an annex)"? In some scripts, like arabic or devanagari, text can seem to "merge" into one long unit, but even those language have printable characters. The encrypted string wont look pretty, but that is ok as long as it is printable. Also, I dont see how it makes a difference that some languages go right to left. – Arimaafan Jan 19 '19 at 20:03
  • @GiacomoCatenazzi If the cipher is based on printable characters, It should work even if part of the text is in english, and another in arabic. – Arimaafan Jan 19 '19 at 20:14
  • @nwellnhof AFAIK you are right about the infinity, but that is not how unicode is used in practice or intended to be used. क्षि (2325 2381 2359 2367) is not a radom set of codepoints that some font vendor has decided should be visible as a single character. Afaik alle the blocks in unicode have been intended to be used to write a finite set of printable characters, even though some of them, like diacritics, have the potential to be used with as yet undefined future characters. – Arimaafan Jan 19 '19 at 20:23
  • @GiacomoCatenazzi You are right that the same grapheme can sometimes be created by different combinations of codepoints. á can be either (225) or (97 769) Such characters are comonly considered to be identical for the purposes of writing, and Perl6 will even "normalize" them by turning both into (225). This means that the cipher would have to be able to recognize both forms. I dont think that is too troublesome. All textparsers or fonts have to do the same. – Arimaafan Jan 19 '19 at 20:36
  • @Arimaafan: there are various normalization (and defined in unicode). You may look also the "compatibility" normalization, to see some characters that are identical to others (but not all compatibility is for that reason). Beta/SS, mu (greek letter and unit) are some examples. The letter A could be Latin or Greek (same upper case, like B, X, K, etc.). For the character above: it depends on language if some character could be merged (ligature). One could override with Zero-width non-joiner (and Zero-width joiner). – Giacomo Catenazzi Jan 20 '19 at 12:04
  • If you want to play: few years ago there were a page who displayed all unicode codepoint (and this show some browser bugs/crashes). Python has the module unicodedata (to get properties of unicode characters) [but you find the original database in http://www.unicode.org/] – Giacomo Catenazzi Jan 20 '19 at 12:07

1 Answers1

9

TL;DR I think your question is reasonable and deserves a better answer than the one I've written so far. Let's talk.


I don't think anyone can create a Caesar cipher with the requirements you list for multiple reasons.

But if your goal is really to "investigate properties of Unicode" rather than create a cipher then presumably that doesn't matter.

And of course I might just be experiencing a failure of imagination, or just a failure to understand Unicode despite spending years grappling with it.

If you critique technical aspects of my explanation below via comments I'll try to improve it and hopefully we'll both learn as we go. TIA.

"Caesar cipher with all Unicode printable characters"

This is the clean formulation you have in your title.

The only problematic parts are "Caesar", "cipher", "all", "Unicode", "printable", and "characters". Let's go thru them.

Caesar cipher

A Caesar cipher is an especially simple single alphabet cipher. Unicode isn't one big large single alphabet. But perhaps you could treat a subset of its codepoints as if they were.

I'd say that's what the SO Cipher with all unicode characters was all about.

You've currently rejected that and introduced a bunch of extra aspects that are either impossible or so difficult that they might as well be.

Ignoring your priority of investigating Unicode properties it would make sense if you instead settled for a regular ASCII cipher. Or perhaps go back to that Cipher with all unicode characters SO and pick up where they left off, perhaps noting that, according to a comment on that SO they apparently stopped at just the BMP plane:

Note that you’re only using BMP code points (i.e. from U+0000 to U+FFFF). Unicode ranges from U+0000 to U+10FFFF, so you’re missing about a million code points :)

So perhaps you could do better. I don't think it would be worthwhile from the perspective of creating a cipher for its own sake but it might be for learning more about the properties of Unicode.

Cipher

@TomBlodget in their comment on your question notes that:

The complexity of text motivates modern ciphers to not deal with characters. They deal with bytes for both input and output. Where the input is text, the receiver has to to be told the character encoding. Where further handling of the output must be as text, Base64 or similar is used. Expecting the output of a cipher to look like text is not generally a goal.

If you want a universal solution for a Unicode cipher, follow Tom's recipe.

All

In a comment on your question about the number of graphemes @nwellnhof noted that:

there's an infinite number

But you then also quite reasonably replied that there's only going to be a finite number in any given text; that Unicode's intent is that Unicode compliant software may/will generate mojibake results if given degenerate input (where what counts as degenerate is somewhat open to refinement in Unicode updates); and that that's the basis on which you hope to proceed.

That's a reasonable response, but you still can't have "all" even when restricted to "all non-degenerate" and "only ones that could appear in real life", because there's still an effectively infinite number of well formed and potentially reasonable characters.

I ought really insert some calculations here to put some bounds on the problem. Is "effectively infinite" a trillion? Why? That sort of thing. But before digging into that I'll await comments.

Let's pretend it's a trillion, and that that's not a problem, and move on.

Unicode

Unicode is enormously complex.

You've been given an assignment to produce a Caesar cipher, a very simple thing.

They really don't mix well unless you lean heavily on keeping things simple.

But you want to investigate properties of Unicode. So perhaps you want to wade into all the complexity. But then the question is, how many years do you want to spend exploring the consequences of opening this pandora's box? (I've been studying Unicode on and off for a decade. It's complicated.)

Printable

You linked to the SO question "What is the range of Unicode Printable Characters?". This includes an answer that notes:

The more you learn about Unicode, the more you realize how unexpectedly diverse and unfathomably weird human writing systems are. In particular whether a particular "character" is printable is not always obvious.

But you presumably read that and refused to be deterred. Which is both admirable and asking for trouble. For example, it seems to have driven you to define "printable" as something like "takes one or more keystrokes to traverse" which is so fraught it's hard to know where to start -- so I'll punt on that till later in this answer.

Characters

Given that your aim is to write a Caesar cipher, a cipher that was used thousands of years ago that acts on characters, it makes sense that you focused on "what a user thinks of as a character".

Per Unicode's definition, this is called a "grapheme".

One of your example characters makes it clear how problematic the distinction is between "what a user thinks of as a character" (a grapheme) and a codepoint (what Python thinks of as a character):

print('क्षि'[::-1])
िष्क

This shows mangling of a single "character" (a single grapheme) written in Devanagari, which is, according to Wikipedia, "one of the most used and adopted writing systems in the world".

(Or, if we want to ignore the half the planet this mangling ever more routinely affects and just focus on the folk who thought they were safe:

print(''[::-1])

That's a flag of one nation turning into another's. Fortunately flags rarely appear in text -- though that's changing now text is increasingly arbitrary Unicode text like this text I'm writing -- and flag characters are not that important and both Great Britain and Bulgaria are members of the EU so it's probably not nearly as bad as scrambling the text of a billion Indians.)

Graphemes

So you quite reasonably thought to yourself, "Maybe Perl 6 will help".

To quote UAX#29, the Unicode Annex document on "Unicode Text Segmentation":

This document defines a default specification for grapheme clusters.

Perl 6 has implemented a grapheme clustering mechanism. It could in principle cluster in a variety of ways but for now it's implemented the default specification. This is what allows Perl 6 to avoid the mistakes Python's making in the above.

But the Unicode document continues:

[the specification for grapheme clusters] may be customized for particular languages, operations, or other situations.

So you can't just eyeball some text (or give it to some software) and say what "characters" it contains if by "character" you mean "what a user thinks of as a character".

It gets worse...

Keystrokes

"‍‍‍ij क्षि " ... notepad ... 10 key strokes ... google search ... 7 strokes ... Perl6 ... Atom editor ... perl6 thinks क्षि ... is actually 2 graphemes ... I think this is a bug ... notepad (and google search) thinks it is a single grapheme/character

For me, google search needs 10 keystrokes -- because it's not to do with google search but instead aspects of my system, including which web browser I'm using (Firefox) and other details.

Some editors could be configurable so that cursoring over 'क्षि' (or 'fi') would be either 1 or 2 keystrokes depending on how you configure them and/or what language you specify the text is written in. For me, editing this SO answer using Firefox on Linux Mint, it takes 2 keystrokes to cursor over क्षि.

Perl 6 correctly reports the .chars result for 'क्षि' as 2 by default because that's what Unicode says it is per the default grapheme clustering specification. ("Extended Grapheme Clusters".) That happens to match what Firefox on Linux Mint does editing this SO answer because the stars line up and it's Sunday.

Notepad or other software reasonably takes just one keystroke to cursor over क्षि, while other editors reasonably take two, because both are reasonable per the Unicode specification:

arrow key movement ... could use knowledge specific to particular fonts to move in a more granular manner, in circumstances where it would be useful to edit individual components

My emphasis added. Unicode leaves it up to the software to decide how the cursor will move.

Your questions

1: Is the above a fair definition of what it means to be a printable character in unicode?

I don't think so. Hopefully the foregoing explains why, or at least points you in the directions you would need to research (for a year or three) to understand why.

2: ... do you know of any list of printable characters that cover the currently used unicode planes and possible grapheme clusters ...

There's such a vast number of "possible grapheme clusters" that can reasonably occur that even excluding degenerate codepoint combinations leaves you with an effectively infinite list.

And any small subset anyone may have created would not be canonical because the Unicode consortium would not bless it and folk would argue about what should be included.

3: ... how would you go about creating such a list with which I could create a caesar cipher that could cipher any/all printable characters given the following 4 conditions?

First, your conditions are far too strenuous. See the next section.

But even if you drop the ones that are too difficult, it's still far too difficult, and the outcome far too uninteresting, to make doing anything worthwhile.

4: If you think creating such a list is a terrible idea, how would you create the cipher?

If it were me and it had to be a Caesar cipher I'd make it just handle bytes, as per Tom's comment at the start of this answer.

Your conditions

The string to be encrypted will be a valid utf8 string consisting of standard unicode code points (no unassigned, or private use area codepoints)

It'll need to be more restricted than that, but it's reasonable to say it'll need to be a valid Unicode string. If you want to insist it's utf8 that's fine too.

The encrypted string must also be a valid utf8 string consisting of standard unicode code points

Sure.

You must be able to traverse the encrypted string using the same number of strokes ... as the original string

You can have that for a small subset of Unicode characters. But...

This means [one keystroke for both the original and encoded version of] the devanagari character [क्षि]

... is not a reasonable requirement.

You could ensure the same grapheme clustering (character) interpretation of a given text if you wrote a custom implementation of the grapheme clustering for your cipher that was a faithful copy of the implementation of grapheme clustering used to control the cursor.

But then you'd then have to maintain these two code bases to keep them in sync. And that would be for just one particular system configuration.

It would be a ridiculous amount of pain. And for zero, or at most minuscule, gain.

the string to be encrypted and the string that has been decrypted (the end result) must contain the exact same codepoints (the 2 strings must be equal).

So, no normalization.

That rules out all Perl 6's grapheme aware goodies.

Then again, you need to drop paying attention to graphemes anyway because there's effectively an infinite number of them.

Conclusion

My answer only touches lightly on the topics it covers and probably contains lots of errors. If you critique it I'll try improve it.

raiph
  • 31,607
  • 3
  • 62
  • 111
  • Thanks for your answer! (sorry for responding late. Busy week). I admit my "traverse" idea was bad. I need a better definition of what it means to be character. A god starting point is probably fonts. Take the google NOTO fonts. The aim is close to mine: identify all the characters that could be visible on screen and provide a font for them, so that people can avoid "Tofu".... – Arimaafan Jan 27 '19 at 20:28
  • ...You wrote: "...there's still an effectively infinite number of well formed and potentially reasonable characters." Fonts dont include things like whitespace, but other than that, how are projects like google NOTO incomplete? what do they leave out? Afaik they are, in principle, a finite set for a given unicode version. – Arimaafan Jan 27 '19 at 20:29
  • @Arimaafan NOTO isn't incomplete. It ensures that all graphemes display without tofu. All of them. I'm ignoring the issue of fonts and gyphs because A) you wrote "assume you have all necessary fonts to see those that should be visible on screen" and B) they're not a problem. – raiph Jan 27 '19 at 22:07
  • @Arimaafan I wrote "I ought really insert some calculations here to put some bounds on the problem. Is "effectively infinite" a trillion? ... But before digging into that I'll await comments." Now you've commented I'm willing to take a look at this. The first issue I'll look at next week is how many relatively generic combiners there are. `say "A keycap grapheme: a\c[Combining Diaeresis, Combining Enclosing Keycap]"` displays `A keycap grapheme: ä⃣`. The keycap combiner seems generic. If there were 10 such generic combiners a filled out alphabet table would contain about a billion graphemes. – raiph Jan 27 '19 at 23:37
  • @Arimaafan It seems like there are hundreds if not thousands of what could reasonably be viewed as "generic combiners". This of course means trillions of trillions of combinations, even after normalization. Resources I've read as I explored show many other intractable problems if one were to attempt to create a cipher remotely like the one you describe. So I'm going to bail at this point. In the next comment I'll leave links to the Unicode site and SO questions that led me to this conclusion. – raiph Feb 04 '19 at 15:52
  • Actually just these two should do. First, [the Unicode standard's Character Properties chapter](https://www.unicode.org/versions/Unicode11.0.0/ch04.pdf). And [one of several SOs that have links worth following](https://stackoverflow.com/questions/21722729/what-is-the-difference-between-combining-characters-and-grapheme-extenders-i). – raiph Feb 04 '19 at 15:58
  • I have never heard of combiners before, but based on what you wrote, I dont see a problem with them. If you have a way to tell which characters can be "combined" with a combiner, then you can just calculate which index number a character + combiner would have in a list of all characters. Effectively, you just have a kind of lazy list from which you COULD create a trillions long actual list if you wanted. – Arimaafan Feb 04 '19 at 16:21
  • Combining characters like the keycap in my comment above or ideographs cause combinatorial explosions in the number of legitimate characters. Unicode's rules determine how they combine. Aiui, before you can run a Caesar cipher you'd need to go ahead and generate *all* these characters so you can remove ones that are degenerate, unprintable, not-byte-wise-reversible, etc. from the characters that will then participate in the alphabet subset that will be the Caesar cipher's alphabet. How can you do that lazily? Surely you have to build the whole alphabet before you can cipher a single character? – raiph Feb 04 '19 at 22:26
  • @Arimaafan And I guess I haven't been explicit that these characteristics (degenerate, unprintable, not-byte-wise-reversible, etc.) can't be determined by computation. They require a human to make the decision. Degeneracy is hopefully obvious; the whole point of the concept is that a degenerate character is technically valid but humans don't use that character. I think you think you can deduce "printable" by leaning on noto and you may be right though I doubt it. Byte-wise-reversible would require near infinite iterations because it might reverse into a technically valid degenerate. Etc. – raiph Feb 09 '19 at 13:15