49

I have a csv parser that reads in 15+ million rows (with many duplicates), and once parsed into structs, need to be added to a collection. Each struct has properties Key (int), A(datetime), and B(int) (and others that aren't relevant here).

Requirement A: The collection needs to enforce uniqueness by a Key.

Requirement B: In a later step, I need the collection sorted by properties A(timestamp) then B(int).

Constraint: The structs eventually need to be traversed in order, one by one, with references to neighbors (a LinkedList presents the cleanest solution here); the point of this operation is to partition the set. Please assume that this is the earliest that partitioning can occur (ie, it cannot be partitioned at the parsing stage).

I've found that the SortedSet works quite well for Requirement A, and it's quite performant as well, even though the O(log n) insertions are much slower than with HashSet<T>'s O(1), though I don't care about sorting on the key. HashSet<T> gets bogged down when the collection gets huge, which apparently is a known issue, while SortedSet<T> does not suffer this drawback.

The problem: When I get to the step for Requirement B, sorting the collection (a SortedSet<T> passed to a method as IEnumerable<T>) takes a prohibitive amount of time (20+ minutes of grinding, all in-memory, no page file usage).

The question: Which collection(s) is(are) best suited to address this problem? One idea is to use two collections: one to enforce uniqueness (like a HashSet<int> or SortedSet<int> of keys), and a second SortedSet<T> to handle sorting at the parsing stage (ie, as far upstream as possible). But the application is already memory-intensive, and the performance penalties of needing the pagefile is prohibitive.
What options does that leave me with for a single collection that enforces uniqueness by one characteristic, but sorts by other unrelated characteristics? SortedSet<T> uses IComparer<T> (but not both IComparer<T> and IEquitable<T>), so if it relies on CompareTo to enforce uniqueness, then it doesn't seem to fit my requirements. Is subclassing SortedSet the way to go?

Edit: The sort code:

SortedSet<Dto> parsedSet = {stuff};
var sortedLinkedStructs = new LinkedList<Dto>(parsedSet.OrderBy(t => t.Timestamp).ThenBy(i => i.SomeInt));

The struct:

public readonly struct Dto: IEquatable<Dto>, IComparer<Dto>, IComparable<Dto>
{
     public readonly datetime Timestamp;
     public readonly int SomeInt;
     public readonly int Key;

     ctor(ts, int, key){assigned}

     public bool Equals(Dtoother) => this.Key == other.Key;
     public override int GetHashCode() => this.Key.GetHashCode();
     public int Compare(Dto x, Dto y) =>  x.Key.CompareTo(y.Key);
     public int CompareTo(Dto other) => this.Key.CompareTo(other.Key);
}
Kevin Fichter
  • 1,067
  • 1
  • 11
  • 17
  • 1
    I think there might be something *very* wrong with your sort code; what you're doing sounds kinda like some things I do, on a system that has 15M rows (it is the "tag engine" that drives a number of features here on Stack Overflow - so 15,223,534 questions at time of writing, plus however many were active and deleted, since I don't reclaim rows); rebuilding the pre-sorted indexes ("newest", "votes", "active", etc) takes *milliseconds*. – Marc Gravell Jan 19 '18 at 16:51
  • Thank you, @MarcGravell. I've added code for my struct and the sort logic. – Kevin Fichter Jan 19 '18 at 17:01
  • sounds like a job for a database. sqlite is very easy to use from c# – pm100 Jan 19 '18 at 17:02
  • @pm100 this program parses csv files into a MS Sql database, via a DataTable and BulkSqlCopy. Is your suggestion something along the lines of loading all of the rows into a staging table, then process those rows into prod? Interesting idea! I had not considered that... – Kevin Fichter Jan 19 '18 at 17:06
  • 6
    You inspired me to write more on this topic, and drop some reference code that you might find useful: http://blog.marcgravell.com/2018/01/a-sort-of-problem.html – Marc Gravell Jan 20 '18 at 00:51
  • 1
    I added another implementation: a fully block-parallel radix sort using only verifiable code (`Span`); works really very nicely – Marc Gravell Jan 22 '18 at 22:18

1 Answers1

82

This might not be a direct answer, but : it is a way that I've used successfully for a similar system of similar scale. This is for the "tag engine" that drives the question lists here on Stack Overflow; Essentially, I have a:

struct Question {
    // basic members - score, dates, id, etc - no text
}

and basically an oversized Question[] (actually I use a Question* in unmanaged memory, but that's because I need to be able to share it with some GPU code for unrelated reasons). Populating the data is just taking out successive rows in the Question[]. This data is never sorted - it is left alone as the source data - with just append (new key) or overwrite (same key); at worst we might need to reallocate and block-copy the data to a new array if we reach max capacity.

Now, instead of sorting that data, I separately keep an int[] (actually int* for the same reason as before, but... meh), where each value in the int[] is the index of the actual data in the Question[]. So initially it may be 0, 1, 2, 3, 4, 5, ... (although I pre-filter this, so it only contains the rows I want to keep - removing "deleted" etc).

using either a modifier parallel quicksort (see http://stackoverflow.com/questions/1897458/parallel-sort-algorithm) or a modified "introspective sort" (like here) - so at the end of the sort, I might have 0, 3, 1, 5, ....

Now: to iterate through the data, I just iterate through the int[], and use that as a lookup to the actual data in the Question[]. This minimizes the amount of data movement during a sort, and allows me to keep multiple separate sorts (perhaps with different pre-filters) very efficiently. It takes milliseconds only to sort the 15M data (which happens every minute or so to bring in new questions into Stack Overflow, or to note changes to existing questions).

To make the sort as fast as possible, I try to write my sort code such that a composite sort can be represented by a single integer value, allowing very effective sort (usable by the introspective sort). For example, here's the code for the "last activity date, then question id" sort:

public override bool SupportsNaturallySortableUInt64 => true;
public override unsafe ulong GetNaturallySortableUInt64(Question* question)
{
    // compose the data (MSB) and ID (LSB)
    var val = Promote(question->LastActivityDate) << 32
        | Promote(question->Id);
    return ~val; // the same as ulong.MaxValue - val (which reverses order) but much cheaper
}

This works by treating the LastActivityDate as a 32-bit integer, left shifting by 32 bits and composing it with the Id as a 32-bit integer, meaning we can compare the date and the id in a single operation.

Or for "score, then answer score, then id":

public override unsafe ulong GetNaturallySortableUInt64(Question* question)
{
    // compose the data
    var val = Promote(question->Score) << 48
        | Promote(question->AnswerScore) << 32
        | Promote(question->Id);
    return ~val; // the same as ulong.MaxValue - val (which reverses order) but much cheaper
}

Note that GetNaturallySortableUInt64 is only called once per element - into a working area of a ulong[] (yes, actually a ulong*) of the same size, so initially the two workspaces are something like:

int[]    ulong[]
0        34243478238974
1        12319388173
2        2349245938453
...      ...

Now I can do the entire sort by looking just at an int[] and a ulong[], such that the ulong[] vector ends up in the sorted order, and the int[] contains the indices of the items to look at.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Thank you for this great answer! Turning my timestamp into an epoch-based int then composing it with the SomeInt should help greatly, and this method might let me turn that new long into a PK, killing two birds with one stone. I'll take a look at the sort suggestions you mentioned. Thanks again! – Kevin Fichter Jan 19 '18 at 17:20
  • @K_foxer if you're just using arrays, then the 2-array version of Array.Sort does exactly this. I had to hack it because of the pointers – Marc Gravell Jan 19 '18 at 18:45
  • I like the optimizations here, but I'm curious about the final numbers. Your blog post's most optimized version says "2.5 seconds", but your answer and comments say that it runs in "milliseconds". I'd expect a few seconds execution time and obviously 2.5 seconds can be expressed as milliseconds, but I'm checking that there isn't an additional optimization I missed in your answer. – Tim M. Jan 20 '18 at 01:20
  • 2
    @Tim a: I did all that on my laptop, not my main PC, and b: I suspect I'm confusing initial sort Vs incremental sort. Either way: run your own timings on your own data. – Marc Gravell Jan 20 '18 at 01:26
  • 1
    @TimMedora btw, I doubled the perf again if you look at the code dump - see the radix sort; on my laptop, down to 1200ms – Marc Gravell Jan 20 '18 at 19:46
  • `DualArrayIndexedRadixSortParallel( r: 8 )` was the best performer on my machine, at ~690ms (release/any cpu/optimizations enabled). Very respectable for a data set that size. – Tim M. Jan 20 '18 at 22:15
  • 1
    @Tim that's a huge improvement from 20 minutes :) I've also invited folks to beat it, in a second blog post – Marc Gravell Jan 21 '18 at 09:21