2

I'm implementing a purpose-built regular expression engine using finite automata. I will have to store thousands of states, each state with its own transition table from unicode code points (or UTF-16 code units; I haven't decided) to state IDs.

In many cases, the table will be extremely sparse, but in other cases it will be nearly full. In most cases, most of the entries will fall into several contiguous ranges with the same value.

The simplest implementation would be a lookup table, but each such table would take up a great deal of space. A list of (range, value) pairs would be much smaller, but slower. A binary search tree would be faster than a list.

Is there a better approach, perhaps leveraging built-in functionality?

Thom Smith
  • 13,916
  • 6
  • 45
  • 91
  • Since (modern) javascript arrays are themselves _sparse_ that shouldn't be a memory problem http://stackoverflow.com/questions/4524067/if-i-set-only-a-high-index-in-an-array-does-it-waste-memory ; Many array elements can also reference the same object, which would give you _some_ space savings when encountering a range. – Stephen P May 02 '17 at 19:22
  • @StephenP The question is about mapping to integers though. You'd have to wrap the integers in order to be able to reference them (because they are a primitive type), so referencing would in fact be slower. – Aurel Bílý May 02 '17 at 19:53
  • 1
    Point taken @Aurel – Stephen P May 02 '17 at 21:24

2 Answers2

1

Sounds like you have two very different cases ("in many cases, the table will be extremely sparse, but in other cases it will be nearly full").

For the sparse case, you could possibly have a separate sparse index (or several layers of indexes), then your actual data could be stored in a typed array. Because the index(es) would be mapping from integers to integers, they could be represented as typed arrays as well.

Looking up a value would look like this:

  1. Binary search the index. The index stores pairs as consecutive entries in the typed array – the first element is the search value, the second is the position in the data set (or the next index).
  2. If you have multiple indexes, repeat 1 as necessary.
  3. Start iterating your dataset at the position given by the last index. Because the index is sparse, this position might not be the one where the value is stored, but it is a good starting point as the correct value is guaranteed to be nearby.
  4. The dataset itself is represented as a typed array where consecutive pairs hold the key and the value.

I cannot think of anything better to use in JavaScript. Typed arrays are pretty fast and having indexes should increase the speed drastically. That being said, if you only have a couple thousand entries, don't bother with indexes, do a binary search directly on the typed array (described in 4. above).

For the dense case, I am not sure. If the dense case happens to be a case where repeated values across ranges of keys are likely, consider using something like run-length encoding – identical consecutive values are represented simply as their number of occurrences and then the actual value. Once again, use typed arrays and binary search, possibly even indexes to make this faster.

Aurel Bílý
  • 7,068
  • 1
  • 21
  • 34
1

Unfortunately, JavaScript's built-in data-types - especially Map - are not of great help in accomplishing this task, as they lack the relevant methods.

In most cases, most of the entries will fall into several contiguous ranges with the same value.

We can however exploit this and use a binary search strategy on sorted arrays, assuming the transition tables won't be modified often.

Encode contiguous input ranges leading to the same state by storing each input range's lowest value in a sorted array. Keep the states at corresponding indices in a separate array:

let inputs = [0, 5, 10]; // Input ranges [0,4], [5,9], [10,∞)
let states = [0, 1, 0 ]; // Inputs [0,4] lead to state 0, [5,9] to 1, [10,∞) to 0

Now, given an input, you need to perform a binary search on the inputs array similar to Java's floorEntry(k):

// Returns the index of the greatest element less than or equal to
// the given element, or undefined if there is no such element:
function floorIndex(sorted, element) {
  let low = 0;
  let high = sorted.length - 1;
  while (low <= high) {
    let mid = low + high >> 1;
    if (sorted[mid] > element) {
      high = mid - 1;
    } else if (sorted[mid] < element) {
      low = mid + 1;
    } else {
      return mid
    }
  }
  return low - 1;
}

// Example: Transition to 1 for emoticons in range 1F600 - 1F64F:
let transitions = {
  inputs: [0x00000, 0x1F600, 0x1F650],
  states: [0,       1,       0      ]
};
let input = 0x1F60B; // 
let next = transitions.states[floorIndex(transitions.inputs, input)];

console.log(`transition to ${next}`);

This search completes in O(log n) steps where n is the number of contiguous input ranges. The transition table for a single state then has a space requirement of O(n). This approach works equally well for sparse and dense transition tables as long as our initial assumption - the number of contiguous input ranges leading to the same state is small - holds.

le_m
  • 19,302
  • 9
  • 64
  • 74