I'm using a Rope to store a large amount (GB's) of text. The text can be tens of millions of lines long.
The rope itself is extremely fast inserting at any position, and is also fast getting a character at a specific position.
However, how would I get where a specific line (\n
for this case) starts? For example, how would I get where line 15 starts? There are a couple options that I can see.
- Don't have any extra data. Whenever you want say the 15th line, you iterate through all the characters in the
Rope
, find the newlines, and when you reach the 15th newline, then you stop. - Store the
start
andlength
of each line in a vector. So you would have yourRope
data structure containing all the characters, and then a separatestd::vector<line>
. Theline
structure would just consist of 2 fields;start
andlength
. Start represents where the line starts inside of theRope
, and length is the length of the line. To get where the 15th line starts, just dolines[14].start
Problems:
#1 is a horrible way to do it. It's extremely slow because you have to go through all of the characters.
#2 is also not good. Although getting where a line starts is extremely fast (O(1)
), every time you insert a line, you have to move all the lines ahead of it, which is O(N)
. Also, storing this means that for every line you have, it takes up an extra 16 bytes of data. (assuming start
and length
are 8 bytes each). That means if you have 13,000,000 lines, it would take up 200MB of extra memory. You could use a linked list, but it just makes the access slow.
Is there any better & more efficient way of storing the line positions for quick access & insert? (Preferably O(log(n))
for inserting & accessing lines)
I was thinking of using a BST
, and more specifically a RB-Tree, but I'm not entirely sure how that would work with this. I saw VSCode
do this but with a PieceTable
instead.
Any help would be greatly appreciated.
EDIT:
The answer that @interjay provided seems good, but how would I handle CRLF if the CR and LF were split between 2 leaf nodes?
I also noticed ropey, which is a rust library for the Rope
. I was wondering if there was something similar but for C++
.