11

I'm looking for the best data structure to add styles to a text (say in a text editor). The structure should allow the following operations:

  1. Quick lookup of all styles at absolute position X
  2. Quick insert of text at any position (styles after that position must be moved).
  3. Every position of the text must support an arbitrary number of styles (overlapping).

I've considered lists/arrays which contain text ranges but they don't allow quick insert without recalculating the positions of all styles after the insert point.

A tree structure with relative offsets supports #2 but the tree will degenerate fast when I add lots of styles to the text.

Any other options?

Jonas
  • 121,568
  • 97
  • 310
  • 388
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • Have you decided how is the text itself going to be stored? Whatever structure the text uses has to efficiently handle insertions/deletions, so it might be possible to expand on that by having the text point to the styles rather than the other way around. Something like accompanying each character with a pointer to an array/list of applicable styles. You should be able to share the styles and the array among characters, and you might event be able to share the pointers themselves. – thkala Nov 15 '10 at 17:03
  • @thkala: Please post that as an answer so I can comment. – Aaron Digulla Nov 16 '10 at 14:35

1 Answers1

4

I have never developped an editor, but how about this:

I believe it would be possible to expand the scheme that is used to store the text characters themeselves, depending of course on the details of your implementation (language, toolkits etc) and your performance and resource usage requirements.

Rather than use a separate data structure for the styles, I'd prefer having a reference that would accompany each character and point to an array or list with the applicable characters. Characters with the same set of styles could point to the same array or list, so that one could be shared.

Character insertions and deletions would not affect the styles themeselves, apart from changing the number of references to them, which could be handled with a bit of reference counting.

Depending on your programming language you could even compress things a bit more by pointing halfway into a list, although the additional bookkeeping for this might in fact make it more inefficient.

The main issue with this suggestion is the memory usage. In an ASCII editor written in C, bundling a pointer with each char would raise its effective memory usage from 1 byte to 12 bytes on a 64 bit system, due to struct alignment padding.

I would look about breaking the text into small variable size blocks that would allow you to efficiently compress the pointers. E.g. a 32-character block might look like this in C:

struct _BLK_ {
    unsigned char size;
    unsigned int styles;
    char content[];
}

The interesting part is the metadata processing on the variable part of the struct, which contains both the stored text and any style pointers. The size element would indicate the number of characters. The styles integer (hence the 32-character limit) would be seen as a set of 32 1-bit fields, with each one indicating whether a character has its own style pointer, or whether it should use the same style as the previous character. This way a 32-char block with a single style would only have the additional overhead of the size char, the styles mask and a single pointer, along with any padding bytes. Inserting and deleting characters into a small array like this should be quite fast.

As for the text storage itself, a tree sounds like a good idea. Perhaps a binary tree where each node value would be the sum of the children values, with the leaf nodes eventually pointing to text blocks with their size as their node value? The root node value would be the total size of the text, with each subtree ideally holding half of your text. You'd still have to auto-balance it, though, with sometimes having to merge half-empty text blocks.

And in case you missed it, I am no expert in trees :-)

EDIT:

Apparently what I suggested is a modified version of this data structure:

http://en.wikipedia.org/wiki/Rope_%28computer_science%29

as referenced in this post:

Data structure for text editor

EDIT 2:

Deletion in the proposed data structure should be relatively fast, as it would come down to byte shifting in an array and a few bitwise operations on the styles mask. Insertion is pretty much the same, unless a block fills up. It might make sense to reserve some space (i.e. some bits in the styles mask) within each block to allow for future insertions directly in the blocks, without having to alter the tree itself for relatively small amounts of new text.

Another advantage of bundling characters and styles in blocks like this is that its inherent data locality should allow for more efficient use of the CPU cache than other alternatives, thus improving the processing speed to some extent.

Much like any complex data structure, though, you'd probably need either profiling with representative test cases or an adaptive algorithm to determine the optimal parameters for its operation (block size, any reserved space etc).

Community
  • 1
  • 1
thkala
  • 84,049
  • 23
  • 157
  • 201
  • +1 for the idea to compress the styles. I've seen the rope structure and it's nice until you need to save styles. – Aaron Digulla Nov 18 '10 at 15:58
  • The main problem with styles is that you need to be able to efficiently answer this question: Which styles are active at position X in the text? Also the style management should be simple when text is inserted/deleted (splitting styles, merging them, removing them). Most editors use a tree structure (like an interval tree: http://en.wikipedia.org/wiki/Interval_tree) but if you add a character somewhere, you need to recalculate all intervals towards the end of the text. – Aaron Digulla Nov 18 '10 at 16:03
  • @Aaron Digulla: Yes, keeping the styles in sync with the text can be messy if they are not bundled together. On the other hand methods similar to the one I proposed need (much?) more memory to store text with relatively few style changes, but they are fast to access and modify and they may even become more memory efficient as the text styling becomes more complex. – thkala Nov 18 '10 at 21:54