0

I am implementing a suffix trie (this is different from a suffix tree) that stores the characters suffixes of strings as nodes in a tree structure where a string is made up by following traversing the tree until you hit a '$' or you hit the end of your search.

The problem is that constructing this trie consumes more memory than Java has when using a large text file. Are there any placed that I could cut down on memory use in terms of data structures? This is homework and it isn't a requirement to make it a compressed suffix trie (which is basically a suffix tree).

This is the basic structure that I currently have (I can provide the implementation details if you really want):

// SuffixTrie.java

public class SuffixTrie {
    private SuffixTrieNode root = new SuffixTrieNode();

    // implementation of insertions into tree etc..


    public static void main(String[] args) throws FileNotFoundException {   
        String fileName = "Frankenstein.txt";
        SuffixTrie st = readInFromFile(fileName);
        String[] ss = {"without","hideous", "the only", "onster", ", the", "ngeuhhh"};
        for (String s: ss) {
            SuffixTrieNode sn = st.get(s);
            System.out.println("[" + s + "]: " + sn);
        }
    }
}

Each node is:

// SuffixTrieNode.java
public class SuffixTrieNode {
    private char label; // Indicates the letter for this node
    private boolean isTerminal = false;
    private SuffixTrieData data;
    private HashSet<SuffixTrieNode> children; 
 // Inserting adds more SuffixTrieNodes to the children of the node

The data held in each node is:

public class SuffixTrieData {
    private ArrayList<Pair> startIndexes = new ArrayList<Pair>();

    public SuffixTrieData(int sentence, int index){
        addStartIndex(sentence, index);
    }   
    public class Pair{
        public int sentence;
        public int index;
        public Pair(int sentence, int index){
            this.sentence = sentence;
            this.index = index;
        }
    }
}

The error I get is:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.<init>(Unknown Source)
    at java.util.ArrayList.<init>(Unknown Source)
    at SuffixTrieData.<init>(SuffixTrieData.java:7)
    at SuffixTrie.insert(SuffixTrie.java:20)
    at SuffixTrie.insert(SuffixTrie.java:11)
    at SuffixTrie.readInFromFile(SuffixTrie.java:77)
    at SuffixTrie.main(SuffixTrie.java:89)

It works fine for smaller text files though and this is the first time they have given students this assignment, so the instructors don't know if this is doable with a suffix trie anyway..

Jonno_FTW
  • 8,601
  • 7
  • 58
  • 90
  • I am sure it is doable provided you have enough memory. If the file is too much data for the amount of memory you have, you need to use a more memory efficient data structure. – Peter Lawrey Sep 04 '11 at 07:25
  • @Peter we have to use a suffix trie, it's part of the assignment – Jonno_FTW Sep 04 '11 at 08:54
  • 1
    The simplest thing you can do to reduce memory is to use `private List startIndexes = new ArrayList(1);` similarly you can reduce the initial capacity of the Set. – Peter Lawrey Sep 04 '11 at 09:59
  • Try a compressed suffix tree. See my reply on http://stackoverflow.com/questions/8300364/store-retrieve-a-data-structure/8529333#8529333 – Adrian Dec 16 '11 at 16:06

2 Answers2

1

A suffix trie is going to use a lot of space just for the words (letters) alone. In addition it appears you are storing an array of every sentence the word appears in with an index (the code you post is incomplete, correct me if I'm wrong). If the file is fairly large ... that's going to take up some space.

One thing you could do is compress the sentences when storing, and decompress when retrieving them using deflate/inflate.

Aside from that, you probably want to increase the heap size for the JVM when you run the process, using the -Xmx option (e.g. java -Xmx 2GB -jar myJarFile.jar).

Brian Roach
  • 76,169
  • 12
  • 136
  • 161
  • It takes the suffixes of each sentence. It would be a lot simpler to store each word in a node, but the spec requires that we need to be able to search for partial words, eg. 'onster'. – Jonno_FTW Sep 04 '11 at 08:56
  • Not really sure what you mean by that. A normal suffix trie wouldn't have such a thing. Are you sure you're doing what you're supposed to be doing? – Brian Roach Sep 04 '11 at 13:56
0

Two solutions: either you build a lighter structure (an array list and a hashset per mode is a lot) or, if it's your best solution, you use -mx and -ms command line options for the jam your programs run in.

Snicolas
  • 37,840
  • 15
  • 114
  • 173