17

I'm building a generic interface similar to generators that stream data from a stream to another, to eventually do things like:

file |> toCsv |> filter |> sort |> filter...

I know how to sort a vector/slice, but how can I sort from an incoming stream/iterator without putting it all in a vector?

stream.iter().collect_sorted()

I need to fuse vectors, trees, files, databases, etc., so sometimes I don't know how large the incoming data is without consuming it all.

I'm not against storing the results. The problem is that sorting is tied to slices/vector. I need to be able to do:

datasource |> Algo.sort |> next...

instead of:

let data = datasource |> into_vec
data.sort()
data |> next...

Different sorting algorithms exist for different use cases, so eventually I wish to apply the best for the data at hand:

datasource |> Algo.MergeSort |> next...
datasource |> Algo.BubbleSort |> next...
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
mamcx
  • 15,916
  • 26
  • 101
  • 189
  • 4
    By definition, a sort needs to work on all the data at once surely? You can't just stream it through in O(n) time. –  Feb 15 '19 at 01:40
  • "Now, I know how sort a vector/slice, but how sort form a incoming stream/iterator without put all in a vector? alike:" magic ? – Stargateur Feb 15 '19 at 01:44
  • 1
    Not magic. This is a common problem with databases, where is not wise to load all the data then sort. The thing is I not see apply a sort algorithm like merge sort without put first on a vector. – mamcx Feb 15 '19 at 01:54
  • @mamcx I know many way of sort thing, in place without use more memory that needed is possible, but at some point you need to stock the data somewhere. There is no algorithm that can do what you want, even it's a common problem I doubt there is a common solution, except add more memory to your server. – Stargateur Feb 15 '19 at 01:58
  • I have not problem with storing the results. I know I need that. What I don't know is how in rust "plug-play" the sorting to my storage. – mamcx Feb 15 '19 at 02:00
  • 2
    Reminds me of [Sorting 1 million 8-digit numbers in 1 MB of RAM](https://stackoverflow.com/q/12748246/3650362). You can perhaps apply some clever (domain-specific) tricks, but it sounds like that's not what you want – trent Feb 15 '19 at 03:22

1 Answers1

37

It is literally impossible to sort a set of values without having all of the data. For example, if the iterator had 1 billion instances of 1 followed by a single 0, you simply wouldn't know that the zero needed to go first until you got there. You may wish to re-acquaint yourself with the concept of on- and offline algorithms.

without putting it all in a vector

That's easy: don't use a vector, use any type that implements FromIterator. For example, you could collect into a BinaryHeap:

use std::{collections::BinaryHeap, iter};

fn main() {
    let a_lot_of_numbers = iter::repeat(1).take(100).chain(iter::once(0));
    let data: BinaryHeap<_> = a_lot_of_numbers.collect();
}

Whether this is a good idea or not depends entirely on your case.

If you just don't want to see the vector or just wish to preserve the chaining, then I'd suggest using Itertools::sorted. This uses a Vec internally, meaning that all of the data is stored in memory before the first value is returned:

use itertools::Itertools; // 0.8.0
use std::iter;

fn main() {
    let a_lot_of_numbers = iter::repeat(1).take(100).chain(iter::once(0));

    for v in a_lot_of_numbers.sorted() {
        println!("{}", v);
    }
}

This is a common problem with databases, where is not wise to load all the data then sort

Databases are amazingly complex pieces of software with years of effort put into them considering carefully weighed tradeoffs. You won't find that grade of algorithm lying about in a package manager. Even if you could, databases don't always get it right, requiring skilled programmers to tweak queries to perform better. All you need to know about sorting in Postgres covers a good set of what Postgres can do.

It should be theoretically possible to write an iterator adapter that writes all of the data to disk, performs sorting there, then re-reads the data from the disk. This is called external sorting.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • I think the key to OP's problem will be `FromIterator` and `IntoIterator` - then a module in the chain can simply accept iterators and output iterators. Hard without seeing how OP has implemented things so far though. –  Feb 15 '19 at 03:20
  • @swalladge Indeed. If OP is just using method chaining, then something closer to `Itertools::sorted` will be relevant. – Shepmaster Feb 15 '19 at 03:22
  • @ Shepmaster Looking how itertools do it is enough for me. I need to adapt to a streaming interface I'm doing. With relation to databases, I'm building a relational lang (thing like python but everything is a relation) so everything is in-memory, only need to worry for loading from external sources and implement a few core collections. – mamcx Feb 15 '19 at 03:42
  • It is certainly *not* literally impossible. Ineffective but definitely possible. – Pavel Šimerda Sep 25 '22 at 12:01
  • 3
    @PavelŠimerda kindly outline an algorithm that will sort “iterator […] 1 billion instances of 1 followed by a single 0” (feel free to reduce billion to hundred for testing purposes) without getting all the values. I’m wildly excited to learn how such a thing is possible. – Shepmaster Sep 25 '22 at 12:06