Modeling String Sequences with a Forest of Trees

Question

I am having some difficulty writing a function in python for analyzing sequences of strings found in a list of strings. This function will take as input an integer of n and an ordered list of strings, and will later output a forest of trees representing unique sequences of strings of length n (except for perhaps the last sequence).

I am not quite sure how to approach implementing this function. Any advice or resources I could refer to would be much appreciated.

Edit:

Consider the following example

strings = ['Hello', 'Tim', 'Fish', 'Fish', 'Hello', 'Tim', 'Fish']

Then the build_forest(strings, 3) would produce a forest following structure:

Hello 
  | ___ Tim ___ Fish

 Tim
  | ___ Fish ___ Fish

Fish
  | ___ Fish ___ Hello
  | ___ Hello ___ Tim

To get an answer that really helps you and others, please provide a more concrete example of the inputs and what you've tried so far yourself in a code example, as well as an example or a clear description of the expected output matching the example inputs. Then focus your question on problems in your solution that you need help with. If you really have no clue about the kind of code required, at least provide the setup - i.e. a representation in Python of some example input data and a clear description of the matching output. — Grismar, Feb 07 '19 at 02:27

score 1 · Accepted Answer · answered Feb 07 '19 at 02:53

You can represent this using trie, or prefix tree. Using a modified version of this answer and the rolling window iterator, you can say:

from itertools import islice

def build_trie(paths):
    head = {}
    for path in paths:
        curr = head
        for item in path:
            if item not in curr:
                curr[item] = {}
            curr = curr[item]
    return head

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

from pprint import pprint

pprint(build_trie(window(strings, 3)))

prints

{'Fish': {'Fish': {'Hello': {}}, 
          'Hello': {'Tim': {}}},
 'Hello': {'Tim': {'Fish': {}}},
 'Tim': {'Fish': {'Fish': {}}}}

score 1 · Answer 2 · answered Feb 07 '19 at 03:02

This is pretty similar to making a markov model, except you have multiple branches for the following n-1 possible sequences, and you do not take probability into account.

Do you have any specific class in mind for how the tree could be represented?

A simple solution could involve something like:

class TreeNode:
   def __init___(string):
      self.string = string
      self.children = {}

   def is_child(child_name):
      return child_name in self.children

   def add_child(child_name):
      new_child = TreeNode(child_name)
      self.children[child_name] = new_child
      return new_child

   def get_child(child_name):
      return self.children[child_name]


def make_tree(string_seq, n)
   trees = {}
   for idx in range(len(string_seq) - n):
      # For each possible starts to a tree, check if any trees
      # have begun with that string, and if so add to that tree,
      # otherwise, make a new one.
      tree_position = None
      if string_seq[idx] not in trees:
         tree_position = TreeNode(string[idx])
         trees[string_seq[idx]] = tree_position
      else:
         tree_position = trees[string_seq[idx]]

      # Continue making new branches for any new strings that appear.
      for offset in range(1, n - 1):
         if not tree_position.is_child(string_seq[idx + offset]):
            tree_position.add_child(string_set[idx + offset])
         tree_position = tree_position.get_child(string_set[idx + offset])
   return trees

score 0 · Answer 3 · answered Feb 07 '19 at 02:47

From the example data, another way to describe the problem is this:

given a sequence of n strings,
for all sub-sequences of exactly length m (m < n),
generate a tree data structure that efficiently stores these sub-sequences,
so that the first element of a sub-sequence is at the top level,
the second element is at the first level underneath that, etc.,
and no node is every repeated under a specific parent node

A suitable data structure would be a dictionary, which would look like:

{
    'Hello': {
        'Tim': {
            'Fish': {}
        }
    },
    'Tim': {
        'Fish': {
            'Fish': {}
        }
    },
    'Fish': {
        'Fish': {
            'Hello': {}
        },
        'Hello': {
            'Tim': {}
        }
    },

Turning that into code:

example = ['Hello', 'Tim', 'Fish', 'Fish', 'Hello', 'Tim', 'Fish']


def build_forest(strings, sequence_length):
    assert sequence_length < len(strings)
    # start with an empty dictionary
    result = {}
    # iterate over all sub-sequences of the given length
    for sequence in [strings[i:i + sequence_length] for i in range(len(strings) + 1 - sequence_length)]:
        # keep track of the dictionary at the correct level we're looking at
        d = result
        # try to get all the keys of the sequence in, in order
        for key in sequence:
            # if it wasn't at the current level, add a new dictionary
            if key not in d:
                d[key] = {}
            # start looking at the next level (either new or old)
            d = d[key]
    # at the end, return the constructed dictionary
    return result


print(build_forest(example, 3))

Modeling String Sequences with a Forest of Trees

3 Answers3