4

I'm using a Rope to store a large amount (GB's) of text. The text can be tens of millions of lines long.

The rope itself is extremely fast inserting at any position, and is also fast getting a character at a specific position.

However, how would I get where a specific line (\n for this case) starts? For example, how would I get where line 15 starts? There are a couple options that I can see.

  1. Don't have any extra data. Whenever you want say the 15th line, you iterate through all the characters in the Rope, find the newlines, and when you reach the 15th newline, then you stop.
  2. Store the start and length of each line in a vector. So you would have your Rope data structure containing all the characters, and then a separate std::vector<line>. The line structure would just consist of 2 fields; start and length. Start represents where the line starts inside of the Rope, and length is the length of the line. To get where the 15th line starts, just do lines[14].start

Problems:

#1 is a horrible way to do it. It's extremely slow because you have to go through all of the characters.

#2 is also not good. Although getting where a line starts is extremely fast (O(1)), every time you insert a line, you have to move all the lines ahead of it, which is O(N). Also, storing this means that for every line you have, it takes up an extra 16 bytes of data. (assuming start and length are 8 bytes each). That means if you have 13,000,000 lines, it would take up 200MB of extra memory. You could use a linked list, but it just makes the access slow.

Is there any better & more efficient way of storing the line positions for quick access & insert? (Preferably O(log(n)) for inserting & accessing lines)

I was thinking of using a BST, and more specifically a RB-Tree, but I'm not entirely sure how that would work with this. I saw VSCode do this but with a PieceTable instead.

Any help would be greatly appreciated.

EDIT:

The answer that @interjay provided seems good, but how would I handle CRLF if the CR and LF were split between 2 leaf nodes?

I also noticed ropey, which is a rust library for the Rope. I was wondering if there was something similar but for C++.

WowThere
  • 51
  • 3
  • Did you consider storing the data as a vector of strings representing lines? – Stefan Haustein Mar 31 '21 at 22:14
  • @StefanHaustein That wouldn't have a rope's O(logn) complexity for operations such as concatenation, substring, insertion, etc. They would all take linear time instead. – interjay Mar 31 '21 at 22:16

1 Answers1

2

In each rope node (both leaves and internal nodes), in addition to holding the number of characters in that subtree, you can also put the total number of newlines contained in the subtree.

Then finding a specific newline will work exactly the same way as finding the node holding a specific character index. You would look at the "number of newlines" field instead of the "number of characters" field.

All rope operations will work mostly the same. When creating a new internal node, you just need to add its children's number of newlines. Complexity of all operations is the same.

interjay
  • 107,303
  • 21
  • 270
  • 254
  • Thanks for the answer! I'm gonna try this and report back – WowThere Mar 31 '21 at 22:21
  • 1
    I'm curious, if the lines were CRLF, and there were 2 leaf nodes, and one of the leaf nodes contains a CR and the other one a LF, how would I know it's only 1 line and not 2 separate lines? – WowThere Apr 01 '21 at 19:47
  • @WowThere You can count only the LF and ignore the CRs. – interjay Apr 01 '21 at 21:55
  • But what if I were counting the CR, LF, and CRLF? – WowThere Apr 01 '21 at 21:57
  • @WowThere I don't see the benefit in doing that. Just the LFs are enough to tell you where lines start. – interjay Apr 01 '21 at 22:23
  • If for some reason you do need to support files with a mix of all 3 of CR/LF/CRLF (e.g. technically the Language Server Protocol does, though VS Code itself dodges the issue by normalizing line breaks in its internal representation, making it impossible to edit that sort of file) then things get slightly trickier: you need to track both the number of line breaks and whether each subtree starts with LF or ends with CR, and combine them appropriately. – rpjohnst Dec 31 '22 at 23:32