How to sort efficiently multiple columns represented into a vector of bytes?

Question

I have a collection called Dataframe that is essentially an std::vector<char>. It contains records of a specific length record_size and it may be composed by 1 or multiple fields of different type. The definition of the Dataframe is not negotiable because it is tightly coupled with the rest of the codebase. In general, the structure could contain a huge number of records, in the order of millions and a huge number of columns, in the order of hundreds.

Based on what I've said above, there could be just a single column or multiple ones. The single case column, is what I call the trivial case because the vector of chars can be cast to the actual type and the sorting is extremely efficient.

Where I have the problem is that with multiple columns, I need to have a vector of pointers to the records, apply a comparator function that jumps into memory to access the real data, then swap the pointers, and when the sorting procedure ends, apply to the real data the permutation identified by the pointers. This is slower, especially when the Dataframe contains real world data.

Below an extract of code that better explains the situation. Is there any alternative approach to solve the problem?

#include <iostream>
#include <vector>
#include <cstring>
#include <algorithm>
#include <numeric>
#include <iomanip>
#include <functional>

class Dataframe {

public:
    enum class Base : char
    {
        SIGNED = 'S',
        UNSIGNED = 'U',
        CHAR = 'A',
        // and other types like floats, date, timestamp, etc.
    };

    class Dtype
    {
    public:
        Dtype(Base base_dtype, std::size_t size) : m_base_dtype(base_dtype), m_size(size) {}
        auto base_dtype() const { return m_base_dtype; }
        auto size() const { return m_size; }

    private:
        Base m_base_dtype;
        std::size_t m_size;
    };

    class Comparator {
    public:
        using comparator_fn = std::function<std::int8_t(const char*, const char*)>;
    
        static comparator_fn make_comparator(Dataframe::Base base_dtype, std::size_t size, std::size_t offset) {
            switch (base_dtype) {
                case Dataframe::Base::CHAR:
                    return [offset](const char *a, const char *b) {
                        return std::strcmp(a + offset, b + offset);
                    };
                case Dataframe::Base::SIGNED:
                    return [offset](const char *a, const char *b) {
                        const auto v1 = *reinterpret_cast<const int*>(a + offset);
                        const auto v2 = *reinterpret_cast<const int*>(b + offset);
                        return (v1 < v2) ? -1 : (v1 > v2) ? 1 : 0;
                    };
                case Dataframe::Base::UNSIGNED:
                    return [offset](const char *a, const char *b) {
                        const auto v1 = *reinterpret_cast<const unsigned int*>(a + offset);
                        const auto v2 = *reinterpret_cast<const unsigned int*>(b + offset);
                        return (v1 < v2) ? -1 : (v1 > v2) ? 1 : 0;
                    };
                default:
                    throw std::runtime_error("Unknown base dtype");
            }
        }
    
    
        static comparator_fn make_composite_comparator(const std::vector<Dataframe::Dtype> &dtypes) {
            std::vector<comparator_fn> F;
            std::size_t offset = 0;
    
            for (auto &dtype : dtypes) {
                F.push_back(make_comparator(dtype.base_dtype(), dtype.size(), offset));
                offset += dtype.size();
            }
    
            return [F](const char *a, const char *b) {
                for (auto &f : F) {
                    if (const int8_t result = f(a, b); result != 0)
                        return result < 0;
                }
                return false;
            };
        }
    };

    Dataframe(const std::vector<char> &raw_data, const std::vector<Dtype> &dtypes) : m_raw_data(raw_data), m_dtypes(dtypes)
    {
        cols_count = m_dtypes.size();
        m_record_size = std::accumulate(m_dtypes.begin(), m_dtypes.end(), 0, [](std::size_t acc, const Dtype &dtype) {
            return acc + dtype.size();
        });
        m_records_count = m_raw_data.size() / m_record_size;

        std::cout << "Info:" << std::endl;
        std::cout << "  - cols count: " << cols_count << std::endl;
        std::cout << "  - record size: " << m_record_size << std::endl;
        std::cout << "  - records count: " << m_records_count << std::endl;
        std::cout << std::endl;
    }

    void print() {
        // cycle over the records
        std::cout << "Dataframe:" << std::endl;
        for (std::size_t i = 0; i < m_records_count; i++) {
            // cycle over the dtypes (e.g. columns)
            auto current_field = m_raw_data.data() + i * m_record_size;

            for (std::size_t j = 0; j < m_dtypes.size(); j++) {
                auto dtype = m_dtypes[j];
                auto base_dtype = dtype.base_dtype();
                auto size = dtype.size();
                current_field += j > 0 ? m_dtypes[j - 1].size() : 0;

                // print the data
                switch (base_dtype) {
                    case Base::CHAR:
                        std::cout << std::setw(1) << *current_field << ";";
                        break;
                    case Base::SIGNED:
                        std::cout << std::setw(4) << *reinterpret_cast<int*>(current_field) << ";";
                        break;
                    case Base::UNSIGNED:
                        std::cout << std::setw(4) << *reinterpret_cast<unsigned int*>(current_field) << ";";
                        break;
                }
            }
            std::cout << std::endl;
        }
        std::cout << std::endl;
    }

    void sort() {
        if (cols_count == 1) {
            auto dtype = m_dtypes[0];
            auto base_dtype = dtype.base_dtype();
            auto size = dtype.size();
            auto data = m_raw_data.data();

            // print the data
            switch (base_dtype) {
                case Base::CHAR:
                    std::sort(data, data + m_raw_data.size());
                    break;
                case Base::SIGNED:
                    std::sort((int*)data, (int*)data + m_raw_data.size() / size);
                    break;
                case Base::UNSIGNED:
                    std::sort((unsigned int*)data, (unsigned int*)data + m_raw_data.size() / size);
                    break;
            }
        } else {
            // This is a naive implementation, it works but it's not efficient
            // use the ptr, compare and then swap the records
            auto comparator = Comparator::make_composite_comparator(m_dtypes);

            // create a set of pointers to the records
            std::vector<char*> ptrs;
            for (std::size_t i = 0; i < m_records_count; i++) {
                ptrs.push_back(m_raw_data.data() + i * m_record_size);
            }

            // sort the pointers
            std::sort(ptrs.begin(), ptrs.end(), comparator);

            // apply pointers permutation
            std::vector<char> tmp(m_raw_data.size());
            for (std::size_t i = 0; i < m_records_count; i++) {
                std::copy(ptrs[i], ptrs[i] + m_record_size, tmp.data() + i * m_record_size);
            }
            m_raw_data = tmp;
        }
    }

private:
    std::vector<char> m_raw_data;
    std::vector<Dtype> m_dtypes;
    std::size_t m_records_count;
    std::size_t m_record_size;
    std::size_t cols_count;
};

void trivial_case_with_1_column() {
    std::vector<char> raw;

    // Example, trivial case with 1 column into the dataframe
    int data[] = {8, 30, 8, 19, 7, 6, 10, 5, 3, 2, 1, 5};

    raw.insert(raw.end(), (char*)data, (char*)data + sizeof(data));

    Dataframe df(raw, {Dataframe::Dtype(Dataframe::Base::SIGNED, sizeof(int))});
    df.print();
    df.sort();
    std::cout << "AFTER SORTING" << std::endl;
    df.print();
}

void problematic_case_with_2_or_more_columns() {
    std::vector<char> raw;

    int column1[] = {8, 30, 8, 19, 7, 6, 10, 5, 3, 2, 1, 5};
    char column2[] = {'z', 'a', 'a', 'b', 'c', 'd', 'e', 'x', 'g', 'h', 'i', 'a'};

    for (std::size_t i = 0; i < sizeof(column1) / sizeof(int); i++) {
        raw.insert(raw.end(), (char*)&column1[i], (char*)&column1[i] + sizeof(int));
        raw.insert(raw.end(), (char*)&column2[i], (char*)&column2[i] + sizeof(char));
    }

    const auto dtypes = {Dataframe::Dtype(Dataframe::Base::SIGNED, sizeof(int)), Dataframe::Dtype(Dataframe::Base::CHAR, sizeof(char))};
    Dataframe df(raw, dtypes);

    df.print();
    df.sort();
    std::cout << "AFTER SORTING" << std::endl;
    df.print();
}

int main(int argc, char const *argv[])
{
    std::cout << "This case can be sorted by using a std::sort that casts the pointer to the correct type and it's efficient" << std::endl;
    trivial_case_with_1_column();

    std::cout << "----------------------------------------------------------------\n" << std::endl;

    std::cout << "I don't know how to manage effectively this case" << std::endl;
    problematic_case_with_2_or_more_columns();

    return 0;
}

Online version: https://onlinegdb.com/gP5-8LOmv

There are several possible optimizations for sorting large datasets, but many of them rely on the nature of the data. Can you give some characterization of the data, e.g. are the values in each column mostly unique and randomly distributed or might there be a lot of duplicates? And, is speed to ultimate objective, or is reasonably fast good enough? That is, how much effort are you willing to spend on optimization? With some additional information, I might be able to offer some suggestions. — RandomBits, Apr 21 '23 at 20:16
Also, if you were able to provide a sample dataset for testing performance, that would make it a lot easier to attack the optimization. — RandomBits, Apr 21 '23 at 20:31
@RandomBits In general, I am aiming for full speed. Given that datasets in input tend to be real world datasets, most of the time there could be a lot of duplicates, but in other cases there might be a lot of unique values. It's quite difficult to have something that looks like randomly distributed. Regarding the datasets, a typical one could be the merge of the two columns of this one: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7 https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/8FX9BV — reuseman, Apr 22 '23 at 09:29
Moreover, here you can find the code that contains a helper function to read a csv https://onlinegdb.com/PxPQdiswJ — reuseman, Apr 22 '23 at 09:29
I downloaded the dataset you linked and decompressed the file which is binary. The first uint64 (8 bytes) contains the decimal number 200M which I presume is the number of elements in the dataset. The remaining uint64's are mostly unique and look to be already sorted. Just wanted to make sure that is what I am supposed to see? — RandomBits, Apr 22 '23 at 17:54
@RandomBits You are right, the first uint contains info about the size. Regarding the order, it's my mistake. The dataset is indeed ordered, but in this case a random permutation can be applied to have mock up data in terms of size. About a real one even in terms of distribution, you can find a simple example that contains just 2 columns here: https://www.kaggle.com/datasets/reusemann/matches — reuseman, Apr 22 '23 at 20:59
I ran some benchmarks sorting the 200M int64's (after shuffling them) using std::sort with three methods: 1) direct sort (~14s), 2) sorting pointers (~27s) and 3) sorting indices (~27s). As you noted, there is a high cost to sorting indirectly which is the only option if the record length is not known at compile time and you are using std::sort. All of these operations are probably limited by memory bandwidth and the indirect methods use ~2x the memory. I want to try a custom sorting function that takes the record length as a parameter and see if that can reduce this overhead. — RandomBits, Apr 22 '23 at 21:44
I successfully used a radix based sort that produces a sort index to indirectly sort the 200M int64's with almost the same performance as std::sort (14.2s versus 15.0s). This should allow any size record and composite keys without the penalty of sorting pointers with std::sort. I will try to clean up the code and post it tomorrow. — RandomBits, Apr 23 '23 at 03:28
@RandomBits Feel free to publish a draft, even that would be useful — reuseman, Apr 24 '23 at 08:02

score 5 · Accepted Answer · answered Apr 27 '23 at 20:23

This is a more detailed followup to my initial answer.

I built some tools to measure the performance of different sort algorithms using synthetically generated datasets. The code and build instructions are available in the GitHub repo core-cpp/sort.

Based on OP's problem description, I used the following three datasets, each with 10M rows, as benchmarks:

1 column of uint64_t's with a single sort key on the column.
32 columns of uint64_t's with a single sort key on the first column.
32 columns of uint64_t's with a sort key consisting of the first 8 columns.

While std::sort uses a highly optimized version of quicksort that is almost branchless, there is not an obvious way to invoke it to directly sort records whose length is not know at compile time.

One option is to use indirection with either pointers or an index and not actually sort the data itself. For large datasets, there is a large non-locality penalty because key comparisons require reading effectivley random rows from the dataset. This more than offsets any benefits.

Another option is to use iterators that also implement std::swap. This is insufficient for std::sort, though, as it uses insertion sort at the leaves which requires dynamic memory allocation for dynamic record sizes.

Algorithms

Based on this understanding, I implemented the following sort algorithms for comparison:

`std::sort` for case #1 only

As a special case, if a record is simply a single uint64_t, directly invoke std::sort passing pointers to the data and a lambda for the direct comparison of the uint64_t values. This represents the best possible performance. In the general case, we will not know the record length or key type at compile time so this method is simply a benchmark for comparison.

`std::sort` for case #1 only with generic compare

Similar to the first case above, but using the generic comparison function instead of assumming a compile-time known key type. This represents a more realistic upper limit on expected performance since in general the key will not be known at compile time. In the general case, we will not know the record length at compile time, so again this mehtod is simply another benchmark fo comparison.

C Library `qsort_r`

The C Library function qsort_r provides an interface that allows the specication of a record length and a comparison function. This enables directly sorting the records in a straight forward manner. The function interface varies between OSX and Linux, so some conditional compilation is required.

Custom quicksort.

The basic quicksort algorithm is conceptually straight forward and I implemented a simple version that allows for runtime record lenghts and comparison functions. I still need to experiment with more advanced pivot selection.

`std::sort` with pointers

Construct a vector of pointers that references the rows of the dataset and invoke std::sort on the pointers with a custom comparison lambda.

`std::sort` with index

Construct an index vector that references the rows of the dataset and invoke std::sort on the index vector with a custom comparison lambda.

Custom radix sort

The basic radix sort algorithm is conceptually straight forward and I implemented a simple version that allows for runtime record lengths and comparison functions.

Performance

I measured the performance across three different architectures (all hardward has plentiful main memory relative to the dataset size).

Apple M1 Max 128Gb
Intel Zeon E5-2698 256Gb
AMD 7601 32Gb

There was a suprising amount of variation in relative performance among the different algorithms across hardware platforms. In particular, one of the radix methods showed extreme promise on the arm64 and Intel, but was the worst performer on AMD.

Here are some typical runs for benchmark case 1:

M1

$ make sort1 && sort1 -n 10000000 -r 8 u64:0
                   std::sort-direct:   637 ms
  std::sort-direct-compare-function:  1290 ms
                     qsort_r-direct:  1859 ms
                         quick-sort:  1560 ms
                   quick-block-sort:  1596 ms
                  std::sort-pointer:  1879 ms
                    std::sort-index:  1956 ms
                   radix-index-sort:   615 ms
               radix-mem-index-sort:   415 ms
               merge-bottom-up-sort:  2031 ms

Intel

$ make sort1 && sort1 -n 10000000 -r 8 u64:0
                   std::sort-direct:   966 ms
  std::sort-direct-compare-function:  2481 ms
                     qsort_r-direct:  3705 ms
                         quick-sort:  3122 ms
                   quick-block-sort:  3416 ms
                  std::sort-pointer:  3863 ms
                    std::sort-index:  4068 ms
                   radix-index-sort:  5151 ms
               radix-mem-index-sort:  1797 ms
               merge-bottom-up-sort:  3869 ms

AMD

$ make sort1 && sort1 -n 10000000 -r 8 u64:0
                   std::sort-direct:  1121 ms
  std::sort-direct-compare-function:  2461 ms
                     qsort_r-direct:  3718 ms
                         quick-sort:  3333 ms
                   quick-block-sort:  3752 ms
                  std::sort-pointer:  7712 ms
                    std::sort-index:  7141 ms
                   radix-index-sort: 13216 ms
               radix-mem-index-sort: 11665 ms
               merge-bottom-up-sort:  4497 ms

If we take std::sort-direct-compare-function as a baseline, the custom quicksort implementation and the qsort_r C library function are the most consistent performers. The following tables shows the running time relative to baseline.

algo \ cpu	M1	Intel	AMD
`quicksort`	1.21	1.26	1.35
`qsort_r`	1.44	1.49	1.51
`radix-mem`	0.32	0.72	4.74

As with any optimization strategy, the actually data and specific hardware will be important factors.

RandomBits · Answer 2 · 2023-04-26T02:51:52.717

This is not a complete answer, but demonstrates four variations of sorting: pointers, index, radix and vanilla. The code generates random bytes organized as a specified number of rows and columns. It then runs the different sort variations based on the specified sort keys and outputs basic timing information and checks that the sort results are correct.

If the data rows are 8-bytes and there is a single sort key, it will also run a vanilla sort using std::sort assuming a uint64_t data type.

There are still plenty of optimizations to think about, but here is some sample output running on an Apple M1.

$ sort -v -n 10000000 -r 8 u64:0
creating random dataset with 10000000 rows and 8 bytes-per-row
dataset created in 111ms
pointer sort: 1816ms
index sort: 1807ms
radix sort: 459ms
vanilla sort: 619ms

$ sort -v -n 10000000 -r 128 u64:32
creating random dataset with 10000000 rows and 128 bytes-per-row
dataset created in 1537ms
pointer sort: 2755ms
index sort: 2734ms
radix sort: 878ms

Edit: Latest Results

# arm64 - Apple M1 Ultra 128Gb
make sort1 && sort1 -v -n 10000000 -r 8 u64:0
creating random dataset with 10000000 rows and 8 bytes-per-row
dataset created: 427ms
pointer sort: 1883ms
index sort: 1860ms
radix sort: 473ms
radix-dup sort: 262ms
merge-bottom-up sort: 2082ms
quick sort: 1369ms
vanilla sort: 629ms

# x86 - Intel Xeon E5-2698 256Gb
make sort1 && sort1 -v -n 10000000 -r 8 u64:0
creating random dataset with 10000000 rows and 8 bytes-per-row
dataset created: 3942ms
pointer sort: 3798ms
index sort: 3954ms
radix sort: 4927ms
radix-dup sort: 1495ms
merge-bottom-up sort: 3773ms
quick sort: 2560ms
vanilla sort: 909ms

# x86 - AMD 7601 32Gb
make sort1 && sort1 -v -n 10000000 -r 8 u64:0
creating random dataset with 10000000 rows and 8 bytes-per-row
dataset created: 2523ms
pointer sort: 6363ms
index sort: 7089ms
radix sort: 12431ms
radix-dup sort: 10055ms
merge-bottom-up sort: 4261ms
quick sort: 3918ms
vanilla sort: 1104ms

Edit: Here is a GitHub repo with the code and build instructions.

Edit: Here are some results for x86 and the performance characteristics are much different than for arm64. More to investigate because there is a large performance penalty on x86 for all but the vanilla sort.

$ sort1 -v -n 10000000 -r 8 u64:0
creating random dataset with 10000000 rows and 8 bytes-per-row
dataset created in 444ms
pointer sort: 11283ms
index sort: 11420ms
radix sort: 15717ms
vanilla sort: 1110ms

Here is the code. Unfortunately, it is specific to my development environment at the moment. I plan to put it on GitHub with the relevant dependencies to allow it to be built easily.

#include <random>
#include "core/util/tool.h"
#include "core/chrono/stopwatch.h"
#include "core/util/random.h"
#include "core/string/split.h"

enum class DataType { Signed64, Unsigned8, Unsigned16, Unsigned32, Unsigned64 };
using SortIndex = std::vector<uint>;

namespace core::str::detail {
template<>
struct lexical_cast_impl<DataType> {
    static DataType parse(std::string_view s) {
    using enum DataType;
    if (s == "i64") return Signed64;
    else if (s == "u8") return Unsigned8;
    else if (s == "u16") return Unsigned16;
    else if (s == "u32") return Unsigned32;
    else if (s == "u64") return Unsigned64;
    throw lexical_cast_error(s, "DataType");
    }
};
}; // core::str::detail

using ElementType = uint8_t;

class Frame {
public:
    using element_type = ElementType;

    struct iterator {
    using iterator_category = std::random_access_iterator_tag;
    using difference_type = std::ptrdiff_t;
    using value_type = element_type*;
    using pointer = value_type*;
    using reference = value_type&;

    iterator(value_type ptr, size_t row_size)
        : ptr_(ptr)
        , row_size_(row_size) {
    }

    auto operator++() {
        ptr_ += row_size_;
        return *this;
    }

    auto operator++(int) {
        iterator tmp = *this;
        ++(*this);
        return tmp;
    }

    reference operator*() {
        return ptr_;
    }

    pointer operator->() {
        return &ptr_;
    }

    friend bool operator==(iterator a, iterator b) {
        return a.ptr_ == b.ptr_;
    }

    private:
    value_type ptr_;
    size_t row_size_;
    };

    Frame(element_type *data, size_t bytes_per_row, size_t number_rows)
    : data_(data)
    , row_size_(bytes_per_row)
    , size_(number_rows) {
    }

    auto data() const {
    return data_;
    }

    auto row(size_t idx) const {
    return data_ + idx * row_size_;
    }

    auto operator[](size_t idx) const {
    return data_[idx];
    }

    auto operator[](size_t idx, size_t jdx) const {
    return *(data_ + idx * row_size_ + jdx);
    }

    iterator begin() const {
    return iterator{data_, row_size_};
    }

    iterator end() const {
    return iterator{data_ + row_size_ * size_, row_size_};
    }

    size_t row_size() const {
    return row_size_;
    }

    size_t size() const {
    return size_;
    }

private:
    element_type *data_;
    size_t row_size_, size_;
};

struct Key {
    DataType type;
    size_t offset;

    size_t length() const {
    switch (type) {
    case DataType::Signed64:
        return 8;
    case DataType::Unsigned8:
        return 1;
    case DataType::Unsigned16:
        return 2;
    case DataType::Unsigned32:
        return 4;
    case DataType::Unsigned64:
        return 8;
    }
    }
};

using Keys = std::vector<Key>;

namespace core::str::detail {
template<>
struct lexical_cast_impl<Key> {
    static Key parse(std::string_view s) {
    auto fields = split(s, ":");
    if (fields.size() != 2)
        throw lexical_cast_error(s, "Key");
    auto data_type = lexical_cast<DataType>(fields[0]);
    auto offset = lexical_cast<uint64_t>(fields[1]);
    return Key{data_type, offset};
    }
};
}; // core::str

bool compare(ElementType *a_ptr, ElementType *b_ptr, const Keys& sort_keys) {
    for (const auto& key : sort_keys) {
    switch (key.type) {
        using enum DataType;
    case Unsigned8:
        {
        auto a = *reinterpret_cast<uint8_t*>(a_ptr + key.offset);
        auto b = *reinterpret_cast<uint8_t*>(b_ptr + key.offset);
        if (a < b) return true;
        else if (a > b) return false;
        }
        break;

    case Unsigned16:
        {
        auto a = *reinterpret_cast<uint16_t*>(a_ptr + key.offset);
        auto b = *reinterpret_cast<uint16_t*>(b_ptr + key.offset);
        if (a < b) return true;
        else if (a > b) return false;
        }
        break;

    case Unsigned32:
        {
        auto a = *reinterpret_cast<uint32_t*>(a_ptr + key.offset);
        auto b = *reinterpret_cast<uint32_t*>(b_ptr + key.offset);
        if (a < b) return true;
        else if (a > b) return false;
        }
        break;

    case Unsigned64:
        {
        auto a = *reinterpret_cast<uint64_t*>(a_ptr + key.offset);
        auto b = *reinterpret_cast<uint64_t*>(b_ptr + key.offset);
        if (a < b) return true;
        else if (a > b) return false;
        }
        break;

    case Signed64:
        {
        auto a = *reinterpret_cast<int64_t*>(a_ptr + key.offset);
        auto b = *reinterpret_cast<int64_t*>(b_ptr + key.offset);
        if (a < b) return true;
        else if (a > b) return false;
        }
        break;
    }
    }
    return false;
}

auto pointer_sort(const Frame& frame, const Keys& sort_keys) {
    std::vector<ElementType*> ptrs;
    for (auto ptr : frame)
    ptrs.push_back(ptr);

    std::sort(ptrs.begin(), ptrs.end(), [&](ElementType *a_ptr, ElementType *b_ptr) {
    return compare(a_ptr, b_ptr, sort_keys);
    });

    return ptrs;
}

auto index_sort(const Frame& frame, const Keys& sort_keys) {
    SortIndex index(frame.size());
    for (auto i = 0; i < frame.size(); ++i)
    index[i] = i;

    auto ptr = frame.data();
    std::sort(index.begin(), index.end(), [&](int idx, int jdx) {
    return compare(ptr + idx * frame.row_size(), ptr + jdx * frame.row_size(), sort_keys);
    });

    return index;
}

size_t total_key_length(const Keys& keys) {
    size_t n{};
    for (const auto& key : keys)
    n += key.length();
    return n;
}

using RadixIndex = int;
constexpr auto RadixSize = 257;

auto radix_sort(const Frame& frame, const Keys& sort_keys) {
    Keys keys = sort_keys;
    std::reverse(keys.begin(), keys.end());

    const auto key_length = total_key_length(keys);
    RadixIndex buckets[key_length][RadixSize];
    memset(buckets, 0, key_length * RadixSize * sizeof(RadixIndex));

    for (auto row : frame) {
    auto bdx = 0;
    for (const auto& key : keys) {
        auto field = row + key.offset;
        switch (key.type) {
        using enum DataType;
        case Unsigned8:
        ++buckets[bdx++][1 + field[0]];
        break;
        case Unsigned16:
        for (auto i = 0; i < 2; ++i)
            ++buckets[bdx++][1 + field[i]];
        break;
        case Unsigned32:
        for (auto i = 0; i < 4; ++i)
            ++buckets[bdx++][1 + field[i]];
        break;
        case Unsigned64:
        for (auto i = 0; i < 8; ++i)
            ++buckets[bdx++][1 + field[i]];
        break;
        case Signed64:
        for (auto i = 0; i < 8; ++i)
            ++buckets[bdx++][1 + field[i]];
        break;
        }
    }
    }

    for (auto i = 0; i < key_length; ++i) {
    auto *counts = buckets[i];
    for (auto j = 1; j < 257; ++j)
        counts[j] += counts[j - 1];
    }

    SortIndex index(frame.size()), new_index(frame.size());
    for (auto i = 0; i < frame.size(); ++i)
    index[i] = i;

    auto bdx = 0;
    for (const auto& key : keys) {
    switch (key.type) {
        using enum DataType;
    case Unsigned8:
    case Unsigned16:
    case Unsigned32:
    case Unsigned64:
        for (auto i = 0; i < key.length(); ++i) {
        auto *counts = buckets[bdx++];
        auto offset = key.offset + i;
        for (auto j = 0; j < frame.size(); ++j) {
            auto value = frame[index[j], offset];
            auto& loc = counts[value];
            new_index[loc] = index[j];
            ++loc;
        }
        std::swap(index, new_index);
        }
        break;
    case Signed64:
        for (auto i = 0; i < key.length(); ++i) {
        auto *counts = buckets[bdx++];
        auto offset = key.offset + i;
        for (auto j = 0; j < frame.size(); ++j) {
            auto value = frame[index[j], offset];
            auto& loc = counts[value];
            new_index[loc] = index[j];
            ++loc;
        }
        std::swap(index, new_index);
        }
        break;
    }
    }

    return index;
}

bool check_sort(const std::vector<ElementType*>& ptrs, const Keys& sort_keys) {
    for (auto i = 1; i < ptrs.size(); ++i)
    if (not compare(ptrs[i-1], ptrs[i], sort_keys))
        return false;
    return true;
}

bool check_sort(const Frame& frame, const SortIndex& index, const Keys& sort_keys) {
    auto ptr = *frame.begin();
    for (auto i = 1; i < index.size(); ++i)
    if (not compare(ptr + index[i-1] * frame.row_size(),
            ptr + index[i] * frame.row_size(),
            sort_keys))
        return false;
    return true;
}

bool check_sort(const Frame& frame, const Keys& sort_keys) {
    auto ptr = reinterpret_cast<uint64_t*>(frame.data());
    for (auto i = 1; i < frame.size(); ++i)
    if (not (ptr[i-1] < ptr[i]))
        return false;
    return true;
}

int tool_main(int argc, const char *argv[]) {
    ArgParse opts
    (
     argValue<'n'>("number-rows", (uint64_t)100, "Number of rows"),
     argValue<'r'>("bytes-per-row", (uint64_t)64, "Bytes per row"),
     argValues<'*', std::vector, Key>("keys", "Sort keys"),
     argFlag<'v'>("verbose", "Verbose diagnostics")
     );
    opts.parse(argc, argv);
    auto number_rows = opts.get<'n'>();
    auto bytes_per_row = opts.get<'r'>();
    auto sort_keys = opts.get<'*'>();
    auto verbose = opts.get<'v'>();

    if (bytes_per_row bitand 0x7)
    throw core::runtime_error("bytes-per-row must be a multple of 8: {}", bytes_per_row);

    if (sort_keys.size() == 0)
    throw core::runtime_error("At least one sort key must be specified");

    if (verbose)
    cout << fmt::format("creating random dataset with {} rows and {} bytes-per-row",
                number_rows, bytes_per_row)
         << endl;

    chron::StopWatch timer;
    timer.mark();

    auto number_elements = number_rows * bytes_per_row / sizeof(uint64_t);
    std::vector<uint64_t> raw_data(number_elements);
    std::uniform_int_distribution<uint64_t> d{};
    for (auto i = 0; i < number_elements; ++i)
    raw_data[i] = d(core::rng());

    if (verbose) {
    auto millis = timer.elapsed_duration<std::chrono::milliseconds>().count();
    cout << fmt::format("dataset created in {}ms", millis) << endl;
    }

    ElementType *data = reinterpret_cast<ElementType*>(raw_data.data());
    Frame frame{data, bytes_per_row, number_rows};

    timer.mark();
    auto ptrs = pointer_sort(frame, sort_keys);
    if (verbose) {
    auto millis = timer.elapsed_duration<std::chrono::milliseconds>().count();
    cout << fmt::format("pointer sort: {}ms", millis) << endl;
    }
    if (not check_sort(ptrs, sort_keys))
    throw core::runtime_error("pointer sort failed");

    timer.mark();
    auto index = index_sort(frame, sort_keys);
    if (verbose) {
    auto millis = timer.elapsed_duration<std::chrono::milliseconds>().count();
    cout << fmt::format("index sort: {}ms", millis) << endl;
    }
    if (not check_sort(frame, index, sort_keys))
    throw core::runtime_error("index sort failed");

    timer.mark();
    auto radix_index = radix_sort(frame, sort_keys);
    if (verbose) {
    auto millis = timer.elapsed_duration<std::chrono::milliseconds>().count();
    cout << fmt::format("radix sort: {}ms", millis) << endl;
    }

    if (not check_sort(frame, radix_index, sort_keys))
    throw core::runtime_error("radix sort failed");

    if (frame.row_size() == 8 and sort_keys.size() == 1) {
    timer.mark();
    auto ptr = reinterpret_cast<uint64_t*>(frame.data());
    std::sort(ptr, ptr + frame.size(), [](const uint64_t& a, const uint64_t& b) {
        return a < b;
    });
    if (verbose) {
        auto millis = timer.elapsed_duration<std::chrono::milliseconds>().count();
        cout << fmt::format("vanilla sort: {}ms", millis) << endl;
    }
    if (not check_sort(frame, sort_keys))
        throw core::runtime_error("vanilla sort failed");
    }

    return 0;
}

The huge problem is the performance penalty over the x86 architecture, because that's where I currently need to solve the problem. It's amazing that with ARM works flawlessly, though. What do you think could be the issue? Regarding the optimizations, what do you think are the ones that could provide a better return on investment? — reuseman, Apr 25 '23 at 13:13
Yes, I was pretty shocked at the difference. After running some additional experiments and giving it some thought, I don't have a good explanation yet. I am currently profiling the x86 version because I really want to understand what is going on. — RandomBits, Apr 25 '23 at 15:58
I added some new results on one arm64 and two x86 machines for various sorting algorithms. The two contenders look like `quick sort` and `radix-dup sort`, but it will depend a lot on your particular hardware. The `quick sort` method is very simple and might be the easiest for you to try. — RandomBits, Apr 26 '23 at 02:56

How to sort efficiently multiple columns represented into a vector of bytes?

2 Answers2

Algorithms

`std::sort` for case #1 only

`std::sort` for case #1 only with generic compare

C Library `qsort_r`

Custom quicksort.

`std::sort` with pointers

`std::sort` with index

Custom radix sort

Performance

M1

Intel

AMD

Linked

How to sort efficiently multiple columns represented into a vector of bytes?

2 Answers2

Algorithms

std::sort for case #1 only

std::sort for case #1 only with generic compare

C Library qsort_r

Custom quicksort.

std::sort with pointers

std::sort with index

Custom radix sort

Performance

M1

Intel

AMD

Linked

`std::sort` for case #1 only

`std::sort` for case #1 only with generic compare

C Library `qsort_r`

`std::sort` with pointers

`std::sort` with index