2

I am trying to modify csvsort.py (csvkit, https://csvkit.readthedocs.org/en/0.9.0/) in order to handle correctly diacritics.

I have found this code (http://www.gossamer-threads.com/lists/python/python/1030549) that works perfecty for sorting a list:

alphabet = (
u' ', u'.', u'\'', u'-', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8', u'9', u'a', u'A', u'ä', u'Ä', u'á', u'Á', u'â', u'Â',
u'à', u'À', u'å', u'Å', u'b', u'B', u'c', u'C', u'ç', u'Ç', u'd', u'D', u'e', u'E', u'ë', u'Ë', u'é', u'É', u'ê', u'Ê', u'è', u'È',
u'f', u'F', u'g', u'G', u'h', u'H', u'i', u'I', u'ï', u'Ï', u'í', u'Í', u'î', u'Î', u'ì', u'Ì', u'j', u'J', u'k', u'K', u'l', u'L',
u'm', u'M', u'n', u'ñ', u'N', u'Ñ', u'o', u'O', u'ö', u'Ö', u'ó', u'Ó', u'ô', u'Ô', u'ò', u'Ò', u'ø', u'Ø', u'p', u'P', u'q', u'Q',
u'r', u'R', u's', u'S', u't', u'T', u'u', u'U', u'ü', u'Ü', u'ú', u'Ú', u'û', u'Û', u'ù', u'Ù', u'v', u'V', u'w', u'W', u'x', u'X',
u'y', u'Y', u'z', u'Z'
) 

hashindex = {character:index for index, character in enumerate(alphabet)}
def string2sortlist(string):
    return [hashindex[s] for s in string]


import random
things_to_sort = ["".join(random.sample(alphabet, random.randint(4, 6)))
for _ in range(200000)]

print(things_to_sort[:15])

things_to_sort.sort(key=string2sortlist)

print(things_to_sort[:15])

So the question is:

How should I modify

sorter = lambda r: [(r[c] is not None, r[c]) for c in column_ids]
rows.sort(key=sorter, reverse=self.args.reverse)

from csvsort.py, to use hashindex() from the working code above.

TIA.

Miguel.

rbenit68
  • 75
  • 3

1 Answers1

0

The key argument to the sort call is a function that tells how (in what order) to sort items.

Apparently csvkit uses a tuple with a Boolean and then a value, to sort values and have null columns at the end (presumably, maybe it's at the beginning).

Note that csvkit's function is applied to every row for sorting but the sorter function is always applied to the column ids, giving the same order at every row.

So you should define your hash and sorting functions, and then modify the sorter function only to make it :

sorter = lambda r: [(r[c] is not None, string2sortlist(r[c])) for c in column_ids]

This keeps the behaviour with empty column titles. With other column titles, you replace the string with the list of hashed values of the characters, giving you the sorting that you wished.

It is still only applied to the column titles as previously, so you sorting should still be the same on every row.

Cimbali
  • 11,012
  • 1
  • 39
  • 68
  • Thank you; I was very very near. Now I can use other convenient functions instead of string2sortlist(), e.g., strip_accents() (http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string/518232#518232) – rbenit68 Dec 30 '14 at 12:40