6

I have a string that is about a thousand characters long composed of L's, T's, and A's. I'm pretty sure there is a simple pattern in it and I'm wondering if there is any quick and easy way of going about finding it. This string changes so that this is not just a one off.

The pattern I'm looking for is for example if the string was

LLLLLLLLAATAALLLLALLLLLLAATAALLLATLLLLLAATAALLAALLLLLAATAALL

The substring LLLLLAATAALL repeats 4 times in this string. I want to search for substrings like this but I don't know where they start, end, how many there are, and how long they are in the main string.

If there are any tools in Java for looking for this kind of thing any advice would be much appreciated.

nt

Peter Lang
  • 54,264
  • 27
  • 148
  • 161
nite
  • 763
  • 2
  • 9
  • 16
  • 1
    It is kind of an AI problem for me when you don't know the pattern and you want to find. I think that for a limited string, you can find a lot of patterns, the only problem is that how many will be correct in the next string – vodkhang Sep 08 '10 at 05:40
  • That's pretty much a standard computer science problem. See http://en.wikipedia.org/wiki/Longest_repeated_substring_problem and http://www.cs.duke.edu/csed/algoprobs/substring.html – joschi Sep 08 '10 at 06:18
  • This problem, although simple, really intrigues me! What kind of alien transmission did you intercept? (-; Are you just looking for repeated substrings? I can imagine other, more complex patterns as well. Anyway, I am looking into solutions right now and will post an answer if I find anything useful. – Adriaan Koster Sep 08 '10 at 09:06
  • @Adrian lol, unfortunately no aliens. I'm thinking of several different ways to search for a repeating pattern in 2d graphics files and this is my first attemp. – nite Sep 08 '10 at 22:59

2 Answers2

4

Ok, so I took the code from here and adapted it to keep track of and print the longest repeated substring. Just run it using JUnit.

/* Copyright (c) 2010 the authors listed at the following URL, and/or
 the authors of referenced articles or incorporated external code:
 http://en.literateprograms.org/Suffix_tree_(Java)?action=history&offset=20100123220137

 Permission is hereby granted, free of charge, to any person obtaining
 a copy of this software and associated documentation files (the
 "Software"), to deal in the Software without restriction, including
 without limitation the rights to use, copy, modify, merge, publish,
 distribute, sublicense, and/or sell copies of the Software, and to
 permit persons to whom the Software is furnished to do so, subject to
 the following conditions:

 The above copyright notice and this permission notice shall be
 included in all copies or substantial portions of the Software.

 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
 IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
 CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
 TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

 Retrieved from: http://en.literateprograms.org/Suffix_tree_(Java)?oldid=16641
 */

import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import org.junit.Test;

public class SuffixTree {
    @Test
    public void sampleUsage() {

        AbstractSuffixTree tree = new SimpleSuffixTree(
                "LLLLLLLLAATAALLLLALLLLLLAATAALLLATLLLLLAATAALLAALLLLLAATAALL");
        System.out.println("Longest repeating substring "
                + tree.best.printResult() + " repetitions=" + tree.best.visits
                + " length=" + tree.best.stringDepth);
    }
}

abstract class AbstractSuffixTree {

    SuffixTreeNode best;
    String text = null;
    SuffixTreeNode root = null;
    int inputAlphabetSize = -1;

    AbstractSuffixTree(String text) {
        if (text.length() > 0 && text.charAt(text.length() - 1) == '$') {
            this.text = text;
        } else {
            this.text = text + "$";
        }
    }

}

class SimpleSuffixTree extends AbstractSuffixTree {

    public SimpleSuffixTree(String text) {
        super(text);
        constructTree();
    }

    private void constructTree() {
        super.root = new SuffixTreeNode(this);
        best = root;
        char[] s = super.text.toCharArray();
        for (int i = 0; i < s.length; i++) {
            List<String> suffixList = new ArrayList<String>();
            for (int k = i; k < s.length; k++) {
                suffixList.add(s[k] + "");
            }
            super.root.addSuffix(suffixList, i + 1);
        }
    }

}

class CompactSuffixTree extends AbstractSuffixTree {

    public CompactSuffixTree(SimpleSuffixTree simpleSuffixTree) {
        super(simpleSuffixTree.text);
        super.root = compactNodes(simpleSuffixTree.root, 0);
        super.best = simpleSuffixTree.best;
    }

    private SuffixTreeNode compactNodes(SuffixTreeNode node, int nodeDepth) {
        node.nodeDepth = nodeDepth;
        for (SuffixTreeNode child : node.children) {
            while (child.children.size() == 1) {
                SuffixTreeNode grandchild = child.children.iterator().next();
                child.incomingEdge.label += ", "
                        + grandchild.incomingEdge.label;
                child.stringDepth += grandchild.incomingEdge.label.length();
                child.children = grandchild.children;
                // for (SuffixTreeNode grandchild : child.children)
                grandchild.parent = node;
            }
            child = compactNodes(child, nodeDepth + 1);
        }
        return node;
    }

}

class SuffixTreeNode {

    AbstractSuffixTree tree;
    SuffixTreeEdge incomingEdge = null;
    int nodeDepth = -1;
    int label = -1;
    Collection<SuffixTreeNode> children = null;
    SuffixTreeNode parent = null;
    int stringDepth;
    int id = 0;
    public static int c;
    public int visits = 1;

    public SuffixTreeNode(AbstractSuffixTree tree, SuffixTreeNode parent,
            String incomingLabel, int depth, int label, int id) {
        children = new ArrayList<SuffixTreeNode>();
        incomingEdge = new SuffixTreeEdge(incomingLabel, label);
        nodeDepth = depth;
        this.label = label;
        this.parent = parent;
        stringDepth = parent.stringDepth + incomingLabel.length();
        this.id = id;
        this.tree = tree;
    }

    public SuffixTreeNode(AbstractSuffixTree tree) {
        children = new ArrayList<SuffixTreeNode>();
        nodeDepth = 0;
        label = 0;
        this.tree = tree;
    }

    public void addSuffix(List<String> suffix, int pathIndex) {
        SuffixTreeNode insertAt = this;
        insertAt = search(this, suffix);
        insert(insertAt, suffix, pathIndex);
    }

    private SuffixTreeNode search(SuffixTreeNode startNode, List<String> suffix) {
        if (suffix.isEmpty()) {
            throw new IllegalArgumentException(
                    "Empty suffix. Probably no valid simple suffix tree exists for the input.");
        }
        Collection<SuffixTreeNode> children = startNode.children;
        for (SuffixTreeNode child : children) {
            if (child.incomingEdge.label.equals(suffix.get(0))) {
                suffix.remove(0);
                child.visits++;
                if (child.visits > 1
                        && child.stringDepth > tree.best.stringDepth) {
                    tree.best = child;
                }
                if (suffix.isEmpty()) {
                    return child;
                }
                return search(child, suffix);
            }
        }
        return startNode;
    }

    private void insert(SuffixTreeNode insertAt, List<String> suffix,
            int pathIndex) {
        for (String x : suffix) {
            SuffixTreeNode child = new SuffixTreeNode(tree, insertAt, x,
                    insertAt.nodeDepth + 1, pathIndex, id);
            insertAt.children.add(child);
            insertAt = child;
        }
    }

    public String toString() {
        StringBuilder result = new StringBuilder();
        String incomingLabel = this.isRoot() ? "" : this.incomingEdge.label;
        for (int i = 1; i <= this.nodeDepth; i++)
            result.append("\t");
        if (this.isRoot()) {
            c = 1;
            this.id = 1;
        } else {
            this.id = c;
            result.append(this.parent.id + " -> ");
            result.append(this.id + "[label=\"" + incomingLabel + "\"]" + "("
                    + visits + "," + (stringDepth) + ")" + ";\n");
        }
        for (SuffixTreeNode child : children) {
            c++;
            child.id = c;
            result.append(child.toString());
        }
        return result.toString();
    }

    public String printResult() {
        if (parent == null) {
            return "";
        } else {
            return this.parent.printResult() + this.incomingEdge.label;
        }
    }

    public boolean isRoot() {
        return this.parent == null;
    }

    public boolean isLeaf() {
        return children.size() == 0;
    }

}

class SuffixTreeEdge {

    String label = null;
    @SuppressWarnings("unused")
    private int branchIndex = -1;

    public SuffixTreeEdge(String label, int branchIndex) {
        this.label = label;
        this.branchIndex = branchIndex;
    }

}

Output:

Longest repeating substring LLLLLLAATAALLL repetitions=2 length=14

EDIT: response to your comments.

Currently I simply keep track of the longest repeating SuffixTreeNode (it's a field in AbstractSuffixTree). You could modify this so it keeps track of a SortedQueue of nodes, ordered by their stringDepth.

Adriaan Koster
  • 15,870
  • 5
  • 45
  • 60
  • thx a lot. this almost does what im looking for. the only thing is, is that im not looking for the longest repeating substring. when i used it on an actual teststring with 1000 characters, it returned something like: "LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL" – nite Sep 08 '10 at 23:01
  • 1
    its possible that the string im lookin for is the longest, and also it might not be. i havent taken a deep look at the code you provided yet, but is there a way to look for the 2nd longest, 3rd longest, etc.? – nite Sep 08 '10 at 23:02
-1

You will get idea about how to do it from the following code to count the String "hi";

public int countHi(String str) {

int count = 0,i = str.length() - 1; String goal = "hi";

for(int j = 0;j < i;j++) { if(str.substring(j,j+2).equals(goal)) count++;} return count; }

Jinjavacoder
  • 267
  • 1
  • 4
  • 13
  • 2
    Hi Jinjavajin, I undestand how to search for a sunstring in a string. The problem is I don't know what substring I'm searching for. Therefore something like specifying a 'goal' will only work once I know what the pattern is. – nite Sep 08 '10 at 06:59
  • 2
    your code snippet shows terrible brute force which is not good idea when you have thousands of characters and unknown goal – Gaim Sep 08 '10 at 07:24