1

I have recently started learning lucene and came to know about how lucene stores and queries indices. Lucene seems to be using skip list as an underlying data structure. However, I did not find any reason to use skip list over a binary tree.

The advantage with skip lists is that it provides good performance when being used concurrently. And lucene allows single writer thread per index and readers read from immutable segments, so skip list is not helping here either. Other than that binary tree (self balancing) trumps skip list - since it provides worst case complexity of O(logn) for reading and writing whereas skip list provides same time complexity in average case. Also, binary tree would serve range queries in better time compared to skip list. For serving a conjunction query as well, lucene uses skip lists of multiple postings list to find their intersection - for this case too binary tree would have been enough.

Is there any specific reason skip list is used in lucene for indexing purposes that I have missed?

RonC
  • 31,330
  • 19
  • 94
  • 139

1 Answers1

2

Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). See this SO answer for How does lucene index documents?

In that answer, it also indicates that the primary benefit of using Skip-Lists it that it avoids ever having to rebalance a B-Tree. If you'd like to dig deeper that answer cite another one that provides a lot more detail: Skip List vs. Binary Search Tree Which intern references additional whitepapers.

Researching this a bit more, there is one other advantage to using Skip-Lists rather then a BTree. It's not just the rebalancing that is avoided, but also avoided is the locking of a portion of the tree while the rebalancing is taking place. This aspect is discussed further here. This latter advantage improves concurrency.

RonC
  • 31,330
  • 19
  • 94
  • 139
  • So, Lucene uses deterministic skip list in that it will elevate every kth element from current list (this is the invariant). Deterministic skip list provides worst case guarantees that balanced tree provides. But, it also adds complexity and computation time while inserting new keys because skip list needs to "rebalance" (same as balanced tree) on insertion to maintain the invariant. But, since lucene documents have strictly increasing doc ids in a single segment no rebalancing of old nodes on any level is required. This is the advantage over balanced tree that you are referring right? – Rahul Patel Mar 31 '21 at 11:55
  • Yep, that is my understanding. But I am in no way an authority on this topic, only a developer that reads a lot with regard to Lucene. – RonC Mar 31 '21 at 11:58
  • Got it. This makes sense. – Rahul Patel Mar 31 '21 at 12:23
  • 1
    Regarding your edit. Yup, skip list definitely helps in concurrent environment. But, each lucene segment can be either written (lucene allows single writer per index) or read (could be multiple readers. And even for near real time use cases, lucene first flushes the buffered segment and collects further writes in new segment which is not available to query) but not both. So, concurrency would not be of any importance for Lucene. I wanted to understand choice of data structure in the context of lucene – Rahul Patel Apr 01 '21 at 04:11
  • 1
    I studied further and found out few more reasons for skip list being better choice. Lucene query execution is designed using abstraction in the form of iterator. This iterator needs to support two kinds of operation : 1) skip to next higher doc id matching the query 2) given doc_id skip to ceil(doc_id) that matches the query. While, BST can efficiently support second kind of operation efficiently, it cannot support first kind of operation efficiently. Thats where linked list (and its extension skip list) shines. Also, skip list is implemented using array which is better in terms of locality – Rahul Patel Apr 23 '21 at 17:07