1

Is there a more or less standard way to transliterate Polish alphabet with the original ASCII (US-ASCII) characters?

This question can be broken in two related and more precise questions:

  1. How to transliterate 32 letters of Polish alphabet with only 26 letters of basic Latin alphabet maximizing understanding by a Polish reader?
  2. Is there a reversible way to transliterate any Polish text with US-ASCII characters?

I can see that most Polish websites just remove the diacritics in their URLs. For example:

Świętosław Milczący    →  Swietoslaw Milczacy
Dzierżykraj Łaźniński  →  Dzierzykraj Lazninski
Józef Soćko            →  Jozef Socko

This is hardly reversible, but is it the most readable transliteration for Polish readers?

In some other cases, more complicated ad hoc transliteration might be used, like Wałęsa → Wawensa. Are there any standard rules for doing this latter kind of transformations?

P.S. Just to clarify, I'm interested in transliteration rules (like ł → w, ę → en), not the implementation. Something like this table.

Andriy Makukha
  • 7,580
  • 1
  • 38
  • 49

2 Answers2

1

Ad. 1. The Polish alphabet consists only of two groups of letters: the Latin letters and the Latin letters with diacritics. Therefore the only used way to transliterate the Polish letters is to remove diacritic for the last group, for example:

ą --> a
ć --> c
ż --> z
ź --> z
...

This way is the most readable transliteration.

Ad. 2. Definitely no.

Jerzy D.
  • 6,707
  • 2
  • 16
  • 22
  • Thanks! There is a fully reversible way to transliterate Ukrainian with US-ASCII, so I'm pretty sure it's possible for Polish as well. But still thanks for your feedback. – Andriy Makukha Aug 02 '18 at 10:45
1

You could encode presense of diacritics as some kind of ternary number, and store them near the plain ASCII transliteration to make it reversible.

URLs often contain some additional IDs, even this one: 48686148/how-to-transliterate-polish-alphabet-with-us-ascii

Here is example implementation:

trans_table = {
    'A': ('A', 0),   'a': ('a', 0),
    'Ą': ('A', 1),   'ą': ('a', 1),
    'B': ('B', 0),   'b': ('b', 0),
    'C': ('C', 0),   'c': ('c', 0),
    'Ć': ('C', 1),   'ć': ('c', 1),
    'D': ('D', 0),   'd': ('d', 0),
    'E': ('E', 0),   'e': ('e', 0),
    'Ę': ('E', 1),   'ę': ('e', 1),
    'F': ('F', 0),   'f': ('f', 0),
    'G': ('G', 0),   'g': ('g', 0),
    'H': ('H', 0),   'h': ('h', 0),
    'I': ('I', 0),   'i': ('i', 0),
    'J': ('J', 0),   'j': ('j', 0),
    'K': ('K', 0),   'k': ('k', 0),
    'L': ('L', 0),   'l': ('l', 0),
    'Ł': ('L', 1),   'ł': ('l', 1),
    'M': ('M', 0),   'm': ('m', 0),
    'N': ('N', 0),   'n': ('n', 0),
    'Ń': ('N', 1),   'ń': ('n', 1),
    'O': ('O', 0),   'o': ('o', 0),
    'Ó': ('O', 1),   'ó': ('o', 1),
    'P': ('P', 0),   'p': ('p', 0),
    'R': ('R', 0),   'r': ('r', 0),
    'S': ('S', 0),   's': ('s', 0),
    'Ś': ('S', 1),   'ś': ('s', 1),
    'T': ('T', 0),   't': ('t', 0),
    'U': ('U', 0),   'u': ('u', 0),
    'W': ('W', 0),   'w': ('w', 0),
    'Y': ('Y', 0),   'y': ('y', 0),
    'Z': ('Z', 0),   'z': ('z', 0),
    'Ź': ('Z', 1),   'ź': ('z', 1),
    'Ż': ('Z', 2),   'ż': ('z', 2),
}



def pol2ascii(text):
    plain = []
    diacritics = []
    for c in text:
        ascii_char, diacritic = trans_table.get(c, (c, 0))
        plain.append(ascii_char)
        diacritics.append(str(diacritic))

    return ''.join(plain) + '_' + hex(int('1' + ''.join(reversed(diacritics)), 3))[2:]

reverse_trans_table = {
    k: v for v, k in trans_table.items()
}

def ascii2pol(text):
    plain, diacritics = text.rsplit('_', 1)
    diacritics = int(diacritics, base=16)
    res = []

    for c in plain:
        diacritic = diacritics % 3
        diacritics = diacritics // 3
        pol_char = reverse_trans_table.get((c, diacritic), c)
        res.append(pol_char)

    return ''.join(res)


TESTS = '''
Świętosław Milczący
Dzierżykraj Łaźniński
Józef Soćko
'''

for l in TESTS.strip().splitlines():
    plain = pol2ascii(l)
    original = ascii2pol(plain)
    print(original, plain)
    assert original == l
Bunyk
  • 7,635
  • 8
  • 47
  • 79
  • I love the ternary idea! To minimize the numeric value even further, can skip all the diacritics for characters which never have those. For example `mówi` -> `mowi_4` (ternary 11), `mowa` -> `mowa_9` (or just `mowa`; ternary 100). Even though my goal was to avoid using any additional numeric value. – Andriy Makukha Jun 18 '19 at 20:45
  • 1
    @AndriyMakukha Yes, and additionally we could use binary, encoding Z as 0, and Z with diacritics using 1 and differentiating them using next bit. Then additional text in URL decreases from 50% to 20%: https://gist.github.com/bunyk/688a457acfc24f682d8bc2ef1a00d693 It could be even more compact if instead of hex we use urlsafe base64 encoding. – Bunyk Jun 19 '19 at 09:49
  • I would use decimal values instead, since base64 can result in appearance or swear words in URL. Even hexadecimal can result in unwanted words like "dead" ;) I don't want my URL to look something like: `Dziwny-wpis-Walesy-Udostepnil-skandaliczna-grafike-z-Janem-Pawlem-II_DEAD` ;) – Andriy Makukha Jun 19 '19 at 10:27