0

I know this has been asked before, but I have not been able to find a solution.

I'm trying to alphabetize a list of lists according to a custom alphabet.

The alphabet is a representation of the Burmese script as used by Sgaw Karen in plain ASCII. The Burmese script is an alphasyllabary—a few dozen onsets, a handful of medial diacritics, and a few dozen rhymes that can be combined in thousands of different ways, each of which is a single "character" representing one syllable. The map.txt file has these syllables, listed in (Karen/Burmese) alphabetical order, but converted in some unknown way into ASCII symbols, so the first character is u>m;.Rf rather than က or [ka̰]. For example:

u>m;.Rf ug>m;.Rf uH>m;.Rf uX>m;.Rf uk>m;.Rf ul>m;.Rf uh>m;.Rf uJ>m;.Rf ud>m;.Rf uD>m;.Rf u->m;.Rf uj>m;.Rf us>m;.Rf uV>m;.Rf uG>m;.Rf uU>m;.Rf uS>m;.Rf u+>m;.Rf uO>m;.Rf uF>m;.Rf
c>m;.Rf cg>m;.Rf cH>m;.Rf cX>m;.Rf ck>m;.Rf cl>m;.Rf ch>m;.Rf cJ>m;.Rf cd>m;.Rf cD>m;.Rf c->m;.Rf cj>m;.Rf cs>m;.Rf cV>m;.Rf cG>m;.Rf cU>m;.Rf cS>m;.Rf c+>m;.Rf cO>m;.Rf cF>m;.Rf

Each list in the list of lists has, as its first element, a word of Sgaw Karen converted into ASCII symbols in the same way. For example:

[['u&X>', 'n', 'yard'], ['vk.', 'n', 'yarn'], ['w>ouDxD.', 'n', 'yawn'], ['w>wuDxD.', 'n', 'yawn']]

This is what I have so far:

def alphabetize(word_list):
    alphabet = ''.join([line.rstrip() for line in open('map.txt', 'rb')])
    word_list = sorted(word_list, key=lambda word: [alphabet.index(c) for c in word[0]])
    return word_list

I would like to alphabetize word_list by the first element of each list (eg. 'u&X>', 'vk.'), according to the pattern in alphabet.

My code's not working yet and I'm struggling to understand the sorted command with lambda and the for loop.

abarnert
  • 354,177
  • 51
  • 601
  • 671
denvaar
  • 2,174
  • 2
  • 22
  • 26
  • What does the patterns mean? (in `map.txt`)? What does the file look like? – Reut Sharabani Dec 10 '14 at 22:43
  • In what way is it not working yet? What's in `alphabet`, and which values does it sort wrong? – abarnert Dec 10 '14 at 22:43
  • Also, using the parameter name `word` when the argument is going to be a _list_ of words seems like a pretty confusing thing to do, and might be part of the reason you're struggling to understand your code. It also might help to turn the `lambda` into an out-of-line `def`, so you can call it manually on different values and see what it returns (and just so you don't have everything packed into one huge expression that runs off the edge of the screen; you can expand the listcomp into a `for` loop, give things temporary names, etc. if it helps). – abarnert Dec 10 '14 at 22:45
  • `u&X>` is not in your alphabet string nor is `vk.` or `w>ouDxD.`... how can you index when then don't exist? – Padraic Cunningham Dec 10 '14 at 22:49
  • I didn't post the entire sequence because it's really long. – denvaar Dec 10 '14 at 22:52
  • @PadraicCunningham: Despite the misleading variable names, he's actually looking for each _character_ of the first word in each list, not each word. – abarnert Dec 10 '14 at 22:53
  • well if they do exist `word_list = sorted(word_list, key=lambda sub: alphabet.index(sub[0]))` will do what you want – Padraic Cunningham Dec 10 '14 at 22:53
  • That being said, `'D'` isn't anywhere in your "alphabet", so it's still going to fail… – abarnert Dec 10 '14 at 22:53
  • @abarnert, apparently the alphabet is much larger, I presumed the sublists were sorted by the index of the first word of each sublist but maybe not. – Padraic Cunningham Dec 10 '14 at 22:54
  • possible duplicate of [How to sort a list of lists by a specific index of the inner list?](http://stackoverflow.com/questions/4174941/how-to-sort-a-list-of-lists-by-a-specific-index-of-the-inner-list) – smac89 Dec 11 '14 at 00:45
  • I've edited the question to match what I believe you're asking based on the comments. Please review it—and if I'm wrong, please correct it and make it unambiguous. – abarnert Dec 11 '14 at 00:53
  • @Smac89: Unfortunately, that's the easy part of this question, and he's already done that part. His problem is that he can't just use the ASCIIbetical or Unicode order of element 0, he has to look up each character in a map—and those "characters" are apparently anywhere from 3 to 8 ASCII characters in width. So, it's not a dup. – abarnert Dec 11 '14 at 00:55
  • You might have a look at [Pyuca](https://github.com/jtauber/pyuca) that performs a similar form of custom sorting but using Unicode. The method is the same. – dawg Dec 11 '14 at 00:58
  • @dawg: You don't really need that unless you've got multiple locales in the same data, or one of the other edge cases that the older ISO 14651 algorithm that the built-in `locale.strcoll` and friends (at least on some platforms/versions) use. See [the FAQ](http://unicode.org/faq/collation.html) for the differences. I haven't actually tested with Burmese script in a non-Burmese locale, so I could be wrong; maybe you do need it. (I'd probably use PyICU over PyUCA, given that it's seen a lot heavier use, but PyUCA does seem a little easier to use, and is easier to install if you don't have ICU.) – abarnert Dec 11 '14 at 02:58

1 Answers1

0

First, if you're trying to look up the entire word[0] in alphabet, rather than each character individually, you shouldn't be looping over the characters of word[0]. Just use alphabet.index(word[0]) directly.

From your comments, it sounds like you're trying to look up each transliterated-Burmese-script character in word[0]. That isn't possible unless you can write an algorithm to split a word up into those characters. Splitting it up into the ASCII bytes of the transliteration doesn't help at all.


Second, you probably shouldn't be using index here. When you think you need to use index or similar functions, 90% of the time, that means you're using the wrong data structure. What you want here is a mapping (presumably why it's called map.txt), like a dict, keyed by words, not a list of words that you have to keep explicitly searching. Then, looking up a word in that dictionary is trivial. (It's also a whole lot more efficient, but the fact that it's easy to read and understand can be even more important.)


Finally, I suspect that your map.txt is supposed to be read as a whitespace-separated list of transliterated characters, and what you want to find is the index into that list for any given word.


So, putting it all together, something like this:

with open('map.txt', 'rb') as f:
    mapping = {word: index for index, word in enumerate(f.read().split())}
word_list = sorted(word_list, key=lambda word: mapping[word[0]])

But, again, that's only going to work for one-syllable words, because until you can figure out how to split a word up into the units that should be alphabetized (in this case, the symbols), there is no way to make it work for multi-syllable words.

And once you've written the code that does that, I'll bet it would be pretty easy to just convert everything to proper Unicode representations of the Burmese script. Each syllable still takes 1-4 code points in Unicode—but that's fine, because the standard Unicode collation algorithm, which comes built-in with Python, already knows how to alphabetize things properly for that script, so you don't have to write it yourself.

Or, even better, unless this is some weird transliteration that you or your teacher invented, there's probably already code to translate between this format and Unicode, which means you shouldn't even have to write anything yourself.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thanks for the comments. Let me try to be more clear about what the map.txt file is. It's not necessarily a list of words. The white space is irrelevant. It's more like a big long string that would function the same as 'abcdefghijklmnopqrstuvwxyz' Defining what characters come before others in this language. – denvaar Dec 10 '14 at 23:15
  • @DenverSmith: So by "characters" you really mean characters—`u`, then `>`, then `m`, etc.? But most of those characters occur over and over and over, so what does it mean to "come before"? – abarnert Dec 10 '14 at 23:34
  • @DenverSmith: If you're trying to infer a character order from an alphabetized list of words, that doesn't work. For example, the standard English dictionary that comes with most Unix systems starts with `a aa aal aalii aam`, so it's going to tell you that `l` is the second letter in the alphabet, and `b` is the 10th. – abarnert Dec 10 '14 at 23:36
  • So in this language (Sgaw Karen) each character is made up of at least a consonant and a vowel, and possibly a tone. The text in the map file begins with the first consonant and then each vowel and then each tone and then moves to the second consonant and so forth. Maybe there is a better way to store/organize this data, it doesn't seem like it should be so complex. Hope that makes sense. Thanks. – denvaar Dec 11 '14 at 00:02
  • @DenverSmith First, it would really help to put enough information _into the question_ instead of making us dig it out of you, and then making anyone else who wants to help read dozens of comments. Second, that _still_ doesn't answer anything. If the alphabet is a syllabary (like Japanese kana), and `words.txt` is those syllables transliterated into one "unit" apiece separated by whitespace… then my code does exactly what you want. If, on the other hand, it's actually a word list (whether in alphabetical order or not), then you have the same problem as in my last comment. – abarnert Dec 11 '14 at 00:11
  • @DenverSmith: Actually, there's a _second_ problem—and this one is insurmountable. Words in Karen can be multiple syllables. Which means if you just have a table of transliterated syllables, there's no way to look up a word's alphabetical order except to break it into separate characters (as in Burmese-script alphasyllabic characters, not the bytes of ASCII transliteration). What you're trying isn't just not doable in that case, it doesn't even make sense. The obvious solution is to decode your transliteration into Unicode Burmese script and let the standard collation algorithm just work. – abarnert Dec 11 '14 at 00:36
  • This isn't an assignment, I'm just interested in figuring this out, so I'm just trying to figure out the best way to go about it. Thanks for your comments. I will look more into the things you've presented. – denvaar Dec 11 '14 at 02:15
  • @DenverSmith: I wasn't assuming this was an assignment, but, having never seen a transliteration that looks like this, I was assuming that maybe you're, say, a linguistics student who has to process a bunch of data in a format that some professor at his school invented that nobody else uses, in which case there probably wouldn't be any pre-existing library to convert back and forth (or, if there were, you'd know because that prof would tell you…). – abarnert Dec 11 '14 at 02:59