1

I have a stream of chars that I need to keep in a big data structure (can contain billions of chars)

I need to be able to:

  1. store these chars quickly.
  2. get all the chars quickly in order to print them for example
  3. Delete a range of chars without leaving any gaps in the memory.

my first thought was double linked list , but the problem is that is taking to long to get to the middle of the list (begnining of the range)in order to delete.

to solve that I was thinking about a skip list which will make the search of this range faster but then I'm facing the problem of having to re-index each node after deletion

([0,1,2,3,4,5,6,7]

=> delete (3,4)

=> [0,1,2,5,6,7]

=> delete (3,4)

=> [0,1,2,7]

in this example after the first delete I need to give numbers 5,6,7 new indexes )

what is the best way to do this ?

  • "best" is not defined, so I don't think this question is objectively answerable. We need to know your specific requirements and evaluation metrics, so we can tell whether any candidate solution will be acceptable without having to guess what you have in mind. Also, it would help to specify what operations you want to perform on this data structure. Presumably if you have billions of characters stored then there will be limits on how quickly you can get them all. – D.W. Jan 04 '23 at 07:52
  • Can you explain what you mean by "without leaving any gaps" and also why you need to "give numbers 5,6,7 new indexes"? – jbapple Jan 05 '23 at 00:21

1 Answers1

0

It might be helpful to read about the span<T> data structure.

Related Answer: What is a "span" and when should I use one?

A span<T> is:

  • A very lightweight abstraction of a contiguous sequence of values of type T somewhere in memory.
  • Basically a struct { T * ptr; std::size_t length; } with a bunch of convenience methods.
  • A non-owning type (i.e. a "reference-type" rather than a "value type"): It never allocates nor deallocates anything and does not keep smart pointers alive.

I would add that if you are processing a stream of characters, you will probably want to use buffering (or perhaps more apt - "chunking") where each chunk is itself a span<char> of fixed-size (which are all stored in a separate bit of memory) but tracked in a central array (or a more complex data structure like a double-linked-list, to facilitate quick deletion)

It would be an anti-pattern to attempt to actually maintain your entire stream of data in a single piece of contiguous physical memory (which you seem to suggest in part 3 of your request) - especially if you plan on deleting chunks of it. There should other ways to facilitate fast deletion without sacrificing performance elsewhere.

For example if you wish to delete a range of characters that falls into a given span, you can create two new spans from the start and end of the original span, excluding the deleted characters, and then replace the original span instance in your larger data structure (e.g if it were a double-linked list) with the two new smaller spans. None of this requires copying the underlying data itself, just slicing up our lightweight references to the underlying data.

If your language of choice doesn't support span, or a similar structure, check out how span is implemented.

Depending on your language of choice, it may even have built-in support for streaming spans (as .NET Core 2.1+ (2018) does).

Any additional requirements (such fast indexing to any point in your data stream, net of any deletions) can be satisfied by maintaining separate data structures that carry metadata about your spans (such as the suggested linked list). They will need updating when spans are deleted or added to, but because spans are a thin layer on top of large strings of characters, they reduce the cardinality of data structures you are maintaining by several orders of magnitude, so while you could get fancy with maintaining a variety of heaps and maps to facilitate O(1) algorithms for every operation, you will probably find that basic structures and O(log(n)) or even O(N) (where N is actually N/chunk-size) maintenance operations are feasible.

Alain
  • 26,663
  • 20
  • 114
  • 184