Most time-efficient sorting algorithm for array of struct with conditions

Question

I know quicksort is the best sorting algorithm and this would be my first thought as well. But if I have the following situation:

typedef struct{
   char name[35];
   int age;
}person;

If I have an array of type person with N elements and I want to sort people aged between 30-40 by name. Would quicksort still be the best here, since we don't sort all the people in the array, just the ones aged between 30 to 40. If so how would I pick the pivot (since picking the middle element wouldn't guarantee that the person is in the right age range)? I've also thought that maybe merge sort would be a good candidate, since I could just split the array into 2 different arrays and put the conditions.

You need to define what sorting means in this situation. How does the sorted part relate to the ones you don't sort? If there is not a total ordering, the problem becomes something very different. Unless you create a new array and copy the ones you sort over. — user1984, Jan 16 '22 at 20:09
The ones that don't require sorting should stick to their positions, so only there should be an alphabetical order only among the persons aged 30 to 40 — Andrei0408, Jan 16 '22 at 20:11
In that case, I would sweep the array and copy over the persons in the range, sort them, then sweep the array again and copy them over. This approach would be of `O(n + m * log m)` time complexity where `n` is the length of the original array and `m` is the number of persons in the specified range. It has a `O(m)` space complexity. Depending on the data, this would be faster than creating a custom sort on the whole array. — user1984, Jan 16 '22 at 20:26
@Andrei0408, you need to think about this different: Once you have sorted your person elements on age, it allows you to quickly find the index of 30 and 40 (lower-higher). After this you can loop over them. Now 30-40 is only an example on the group you are interested in. — Aldert, Jan 16 '22 at 20:34

score 1 · Accepted Answer · answered Jan 16 '22 at 22:20

The first thing to do is to find the items with age in the range 30..40 and copy them in an optimized growing data structure. This data structure is an array with a capacity growing exponentially every time there is not enough remaining space (ie. std::vector in C++ or for example this in C).

Each item in this new data structure is defined as:

typedef struct {
   uint32_t name_prefix;
   uint32_t id; /* Assume the number of persons is not bigger than 4 billion */
} person_ref;

where name_prefix contains the first characters of person::name and where id contains the position of the item in the initial unfiltered array. Assuming person::name contains ASCII/ANSI characters, it can be for example (p.name[0] << 24) | (p.name[1] << 16) | (p.name[2] << 8) | p.name[3]. While it may sound expensive, modern mainstream processors can do that in only 1 fast instruction. If there is some additional restriction about the name (eg. upper case printable ASCII characters), you can pack more character in this prefix.

Then you can sort the new data structure using a simple quick-sort (or better: an Introsort). Note that qsort can be used in C and std::sort in C++. The comparison operator can be:

/* For a C++ code, please use references instead of pointers */
bool compare(const person& p1, const person* p2) {
    const uint32_t prefix1 = p1.name_prefix;
    const uint32_t prefix2 = p2.name_prefix;
    return prefix1 < prefix2 || 
           (prefix1 == prefix2 && strcmp(array[p1.id].name, array[p2.id].name) < 0);
}

where array is the original array with all the unfiltered items.

Finally, you can copy the item back thanks to the person_ref::id field.

This approach works well since the prefixes of the names are assumed often not to be equal. Indeed, comparing integers is much faster than comparing two strings with strcmp. Moreover, working on small items makes sorting algorithm faster because the copy is less expensive and also because the full array can better fit in fast CPU caches. If a lot prefixes are equal and the input data structure is big, then it should be better to use a 64-bit integer prefix, or to copy the filtered items in another data structure in the worst case (so to better use CPU caches).

If the number of filtered items is huge and you want an even faster sort, you can use a linear-time radix-sort combined with an intro-sort so to sort the person sharing the same prefix.

Most time-efficient sorting algorithm for array of struct with conditions

1 Answers1