3

Given a string in the format {Length}.{Text} (such as 3.foo), I want to determine which string, from a finite list, the given string is. The reader starts at the 0-index and can seek forward (skipping characters if desired).

As an example, consider the following list:

10.disconnect
7.dispose
7.distort

The shortest way to determine which of those strings has been presented might look like:

if (reader.Current == "1")
{
  // the word is "disconnect"
}
else
{
  reader.MoveForward(5);
  if (reader.Current == "p")
  {
    // the word is "dispose"
  }
  else
  {
    // the word is "distort"
  }
}

The question has 2 parts, though I hope someone can just point me at the right algorithm or facet of information theory that I need to read more about.

1) Given a finite list of strings, what is the best way to generate logic that requires the least number of seeks & comparisons, on average, to determine which word was presented?

2) As with the first, but allowing weighting such that hotpaths can be accounted for. i.e. if the word "distort" is 4 times more likely than the words "disconnect" and "dispose", the logic shown above would be more performant on average if structured as:

reader.MoveForward(5);
if (reader.Current == "t")
{
  // the word is distort
}
else //...

Note: I'm aware that the 6th character in the example set is unique so all you need to do to solve the example set is switch on that character, but please assume there is a longer list of words.

Also, this isn't some homework assignment - I'm writing a parser/interception layer for the Guacamole protocol. I've looked at Binary Trees, Tries, Ulam's Game, and a few others, but none of those fit my requirements.

Smudge202
  • 4,689
  • 2
  • 26
  • 44
  • Is there a possibility to sort/order the given list? Or does it need to stay as it is. – Izoman Aug 19 '19 at 13:51
  • You can sort the list. However, the objective here is to design code (that designs code...?) based on rules derived from a given list. Hopefully that makes sense? I've found that if you remove the forward-only rule it's fairly easy to do this task (manually/by hand) by examining the distribution of characters at each index; the higher the distribution the more useful that index. Forward only makes it much more tricky though :-/ – Smudge202 Aug 19 '19 at 14:20
  • 1
    Your question is too broad. That said, I think you're also barking up the wrong tree. **1)** has no one ever written a parser for the protocol yet? It might make more sense to use existing art. **2)** Do you actually have a performance problem? String parsing should not task the computer very much, if at all. Surely I/O and other operations dominate compute cost? ... – Peter Duniho Aug 19 '19 at 15:40
  • 1
    ... **3)** Is "forward only" really the goal that's important here? For example, if the strings always take the form ".", and you have a finite selection of things they could be, I would expect a dictionary to be simpler and faster. **4)** if you insist on the "forward only" strategy, I think you're looking at a state machine, or some variation on that theme. – Peter Duniho Aug 19 '19 at 15:40
  • @PeterDuniho thanks for the answer. 1) Yup, there's a java version out there, I'd just like to do it "better". 2) I'm piecing together different implementations so that I can benchmark them against one another. The forward-only constraint is indeed arbitrary right now (using [SequenceReader](https://docs.microsoft.com/en-us/dotnet/api/system.buffers.sequencereader-1?view=netcore-3.0) which does offer a Rewind). Hopefully I can just micro-benchmark usage of sequence reader Rewinds to confirm the perf diff is non-existent/negligible. 4) Know a good way to generate/represent said state machine? – Smudge202 Aug 19 '19 at 15:57
  • _"Know a good way to generate/represent said state machine?"_ -- you have lots of options. As long as you're micro-optimizing, you might prefer an explicitly constructed state table and transition engine, but in my experience a simple dictionary-based transition table works fine (i.e. each state has a dictionary for input and next state, and those dictionaries are the values in a state-to-transition dictionary). Many years ago I wrote such an implementation and posted it on one of the now-defunct Microsoft C# newsgroups (probably microsoft.public.dotnet.csharp.general), but I can't find it now – Peter Duniho Aug 19 '19 at 16:25
  • @PeterDuniho yeh, if I understand correctly I'm using an explicitly constructed state table which I'm manually calculating (for a variant that doesn't force forward-only which is much easier as I can just look at rolling distributions to determine the logic). Micro-benchmarks show dictionaries as an order of magnitude slower than typed state (~6.65 ns vs ~0.73). Let me know if you find that article or something similar, please! :) – Smudge202 Aug 19 '19 at 16:34
  • 1
    If perf is important and you believe you have tests that demonstrate a significant difference between a dictionary-based approach and an explicitly-implemented state machine, then you should just use the latter. If you are curious about whether your dictionary-based approach could be improved you can post that to codereview.stackexchange.com. I don't plan to do any more searching for my post...archives of the old newsgroups seem to be incomplete, and there doesn't seem to be much point in digging it up anyway. – Peter Duniho Aug 19 '19 at 16:44

1 Answers1

1

I dont know if this would be of any help, but I'll throw my 5 cents in anyway.

What about a tree that automatically gets more granular as you have more strings in the list, and checking of the existing leaves are done with respect to "hotpaths"?

for example, I would have something like this with your list:

10.disconnect 7.dispose 7.distort

root ---- 7 "check 4th letter" ------ if "t" return "distort"
     |      "in the order of "   |
     |      "  hot paths     "    --- if "p"return "dispose"
     |
     ----10 ---- return "disconnect"

you can have this dynamically build up. for example if you add 7.display it would be

root ---- 7 "check 4th letter" ------ if "t" return "distort"
     |      "in the order of "   |
     |      "  hot paths     "    --- if "p" --- "check 5th letter" --- if "o" ...
     |                                                               |
     ----10 ---- return "disconnect"                                  --- if "l" ...

so nodes in the tree would have a variable "which index to check", and leaves corresponding to possible results (order is determined statistically). so something like:

# python example
class node():
  def __init__(which_index, letter):
    self.which_index = which_index # which index this node checks to determine next node
    self.letter = letter # for which letter we go to this node
    self.leaves = SomeLinkedList()

  def add_leaf(node):
    self.leaves.putInCorrectPositionDependingOnHowHotPathItIs(node)

  def get_next(some_string):
    for leaf in self.leaves:
      if some_string[self.which_index] == leaf.letter:
         return leaf
    raise Exception("not found")

another alternative is of course hashing.

But if you are micro-optimizing, it is hard to say as there are other factors that come into play (eg. probably time you save from memory caching would be very significant).

ozgeneral
  • 6,079
  • 2
  • 30
  • 45
  • I see where you're going with this, certainly that's essentially the mechanism I've been using to represent this while trying to brute force solutions to what order the tree should be in. What I was really after was a means to derive the index/letter/decisions. I've been able to manually calculate a forward only decision tree that requires I check at max 3 bytes, typically 2. But I had to do that by hand and I've no clue how best to factor in hotpaths. I suspect I didn't ask my question very well... :) – Smudge202 Aug 21 '19 at 08:03