Radix Sort for Strings in Python

Question

My radix sort function outputs sorted but wrong list when compared to Python's sort:

My radix sort: ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
Python's sort: ['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']

* My radix sort does not do padding
* Its mechanism is least significant bit (LSB)
* I need to utilise the length of each word

The following is my code.

def count_sort_letters(array, size, col, base):
    output   = [0] * size
    count    = [0] * base
    min_base = ord('a')

    for item in array:
        correct_index = min(len(item) - 1, col)
        letter = ord(item[-(correct_index + 1)]) - min_base
        count[letter] += 1

    for i in range(base - 1):
        count[i + 1] += count[i]

    for i in range(size - 1, -1, -1):
        item = array[i]
        correct_index = min(len(item) - 1, col)
        letter = ord(item[-(correct_index + 1)]) - min_base
        output[count[letter] - 1] = item
        count[letter] -= 1

    return output


def radix_sort_letters(array):
    size = len(array)

    max_col = len(max(array, key = len))

    for col in range(max_col):
        array = count_sort_letters(array, size, col, 26)

    return array

Can anyone find a way to solve this problem?

Radix LSB **needs** padding, unfortunately. Maybe you could pad with `@` which is one place before `A` in the ASCII Table, then remove all the `@`s when you output, although this does feel a bit hacky. — AdrienW, Apr 01 '20 at 11:46
In your code the lines `correct_index = min(len(item) - 1, col); letter = ord(item[-(correct_index + 1)]) - min_base`, always uses the first letter of the word once col is greater than the word length. This causes shorter words to be sorted based upon their first letter once col is greater than the word length. For instance `['aa', 'a']` remains unchanged since on the for col loop we compare the 'a' in both words, which keeps the results unchanged. Not sure how to fix your routine but take a look at other [string radix sort](https://codehost.wordpress.com/2011/07/22/radix-sort/) — DarrylG, Apr 01 '20 at 11:47

DarrylG · Answer 1 · 2020-04-02T15:47:47.490

As I mentioned in my comments:

In your code the lines:

correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base

Always uses the first letter of the word once col is greater than the word length. This causes shorter words to be sorted based upon their first letter once col is greater than the word length. For instance ['aa', 'a'] remains unchanged since on the for col loop we compare the 'a' in both words, which keeps the results unchanged.

Code Correction

Note: Attempted to minimize changes to your original code

def count_sort_letters(array, size, col, base, max_len):
  """ Helper routine for performing a count sort based upon column col """
  output   = [0] * size
  count    = [0] * (base + 1) # One addition cell to account for dummy letter
  min_base = ord('a') - 1 # subtract one too allow for dummy character

  for item in array: # generate Counts
    # get column letter if within string, else use dummy position of 0
    letter = ord(item[col]) - min_base if col < len(item) else 0
    count[letter] += 1

  for i in range(len(count)-1):   # Accumulate counts
      count[i + 1] += count[i]

  for item in reversed(array):
    # Get index of current letter of item at index col in count array
    letter = ord(item[col]) - min_base if col < len(item) else 0
    output[count[letter] - 1] = item
    count[letter] -= 1

  return output

def radix_sort_letters(array, max_col = None):
  """ Main sorting routine """
  if not max_col:
    max_col = len(max(array, key = len)) # edit to max length

  for col in range(max_col-1, -1, -1): # max_len-1, max_len-2, ...0
    array = count_sort_letters(array, len(array), col, 26, max_col)

  return array

lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))

Test

lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))

# Compare to Python sort
print(radix_sort_letters(lst)==sorted(lst))

Output

['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']
True

Explanation

Counting Sort is a stable sort meaning:

Let's walk through an example of how the function works.

Let's sort: ['ac', 'xb', 'ab']

We walk through each character of each list in reverse order.

Iteration 0:

Key is last character in list (i.e. index -1):       
keys are ['c','b', 'b'] (last characters of 'ac', 'xb', and 'ab'

Peforming a counting sort on these keys we get ['b', 'b', 'c']

This causes the corresponding words for these keys to be placed in    
the order:    ['xb', 'ab', 'ac']

Entries 'xb' and 'ab' have equal keys (value 'b') so they maintain their 
order of 'xb' followed by 'ab' of the original list 
(since counting sort is a stable sort)

Iteration 1:

Key is next to last character (i.e. index -2):

Keys are ['x', 'a', 'a'] (corresponding to list ['xb', 'ab', 'ac'])

Counting Sort produces the order ['a', 'a', 'a']
which causes the corresponding words to be placed in the order
['ab', 'ac', 'xb'] and we are done.

Original Software Error--your code originally went left to right through the strings rather than right to left. We need to go right to left since we want to sort our last sort to be based upon the first character, the next to last to be based upon the 2nd character, etc.

Different Length Strings-the example above was with equal-length strings.

The previous example was simplified assuming equal length strings. Now let's try unequal lengths strings such as:

['ac', 'a', 'ab']

This immediately presents a problem since the words don't have equal lengths we can't choose a letter each time.

We can fix by padding each word with a dummy character such as '*' to get:

['ac', 'a*', 'ab']

Iteration 0: keys are last character in each word, so: ['c', '*', 'b']

The understanding is that the dummy character is less than all other
characters, so the sort order will be:
['*', 'b', 'c'] causing the related words to be sorted in the order

['a*', 'ab', 'ac']

Iteration 1: keys are next to last character in each word so: ['a', 'a', 'a']

 Since the keys are all equal counting sort won't change the order so we keep

  ['a*', 'ab', 'ac']

Removing the dummy character from each string (if any) we end up with:

    ['a', 'ab', 'ac']
The idea behind get_index is to mimic the behavior of padding strings without actual padding (i.e. padding is extra work). Thus, based upon the index it evaluates if the index points to the padded or unpadded portion of the string and returns an appropriate index into the counting array for counting.

Thank you for considering looking through my code. However, the output you wrote is still different from the Python's sort result. Why is it so? @DarrylG — PenguinSnow, Apr 01 '20 at 12:42
@PenguinSnow--thanks for the tip, I hadn't noticed. I'll fix ASAP. — DarrylG, Apr 01 '20 at 13:32
@PenguinSnow--updated my answer. Let me know if you have any questions or see any errors. — DarrylG, Apr 01 '20 at 16:16
Thank you for going to the length of building on my code. I have a question: what does the get_index function do? It doesn't seem to have code content other than a long piece of documentation. — PenguinSnow, Apr 01 '20 at 16:53
@PenguinSnow--oops, I fixed that. The actual code got dropped during my commenting. — DarrylG, Apr 01 '20 at 17:08
Thank you for the update. So, can you explain just the part where you made the radix sort work where previously it couldn't? — PenguinSnow, Apr 01 '20 at 17:13
@PenguinSnow--upated with more explanation. Let me know if this clarifies things. — DarrylG, Apr 01 '20 at 18:21
https://chat.stackoverflow.com/rooms/210749/ Sorry for not putting this up — PenguinSnow, Apr 02 '20 at 04:05

Radix Sort for Strings in Python

1 Answers1