1

I have a specified String and I need to check this String for identical parts of specified length. For example if the String is "abcdab" and the length is specified as 2, the identical parts in this string are "ab" (always looking for the most repeated one). I reworked my algorithm 4-5 times for the best performance, but in the end, if the length of the String is 1m+, it throws an Java Heap space error.

So my question is: how to solve the error, maybe there is another way of checking the identical parts, or maybe some other way how to construct the whole algorithm. I figured out 1 possible solution of this, but it works very slowly, so I'm asking only for solutions as fast as my current algorithm or perhaps, even faster ones. Here is the current code:

int length = 2;
String str = "ababkjdklfhcjacajca";
ArrayList<String> h = new ArrayList<String>(); 
h.add(str.substring(0, length));
ArrayList<Integer> contains = new ArrayList<Integer>();
contains.add(1);

String c;
for (int g = 1; g < str.length()-length+1; g++) {
    c = str.substring(g, length+g);
    for (int e = 0; e < h.size(); e++) {
        if (h.get(e).charAt(0) == c.charAt(0) && h.get(e).charAt(length-1) == c.charAt(length-1)) {
            if (h.get(e).equals(c)) {
                contains.set(e, contains.get(e)+1);
                break;
            }
        }
        else if (e+1 == h.size()) {
            h.add(c);
            contains.add(1);
            break;
        }
    }

}

ArrayList h stores every unique part of the String, ArrayList contains represents amount of repeats of that every unique part of the string. String c is the main problem (java heap space points here). It gradually represents each part of the string before it gets stored in the ArrayList h (if that c is unique). After that I'll just find the most repeated ones by using the ArrayLists and print them.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • 1
    A sidenote while I'm reading your code: it would be much easier to understand if you name your variables with meaningful names – GalAbra Jan 22 '18 at 16:41

4 Answers4

1

If you want to do the search efficiently, time and memory wise, I suggest you the following:

  1. First, create a simple character histogram containing the number of occurrences of each character. If a substring’s first character has less occurrences than the most common substring we’ve found so far, we can skip this substring.

  2. Instead of creating substrings which contain copies of the character contents, we use a CharBuffer which wraps the string and adjust its position and limit to represent a subsequence. Of course, we must not modify a buffer once it has been stored as key in our map, so we create a new buffer for each key when it has been stored in the map. So we create at most one CharBuffer for each distinct substring and these buffers still only wrap the String rather then copy any character data

public static Map<String,Integer> mostCommonSubstring(String s, int len) {
    int[] charHistogram = new int[Character.MAX_VALUE+1];
    s.chars().forEach(ch -> charHistogram[ch]++);
    int most = 0;
    HashMap<Buffer, Integer> subStrings = new HashMap<>();
    CharBuffer cb = CharBuffer.wrap(s);
    for(int ix = 0, e = s.length()-len; ix <= e; ix++) {
        if(charHistogram[s.charAt(ix)] < most) continue;
        int num = subStrings.merge(cb.limit(ix+len).position(ix), 1, Integer::sum);
        if(num == 1) cb = CharBuffer.wrap(s);
        if(num > most) most = num;
    }
    final int mostOccurences = most;
    return subStrings.entrySet().stream()
        .filter(e -> e.getValue() == mostOccurences)
        .collect(Collectors.toMap(e -> e.getKey().toString(), Map.Entry::getValue));
}

The first two lines create our histogram

    int[] charHistogram = new int[Character.MAX_VALUE+1];
    s.chars().forEach(ch -> charHistogram[ch]++);

Within the loop

        if(charHistogram[s.charAt(ix)] < most) continue;

checks whether the first character of the current substring has less occurrences than the most common string we’ve found so far and skip the subsequent test in that case.

The next line adapts the current buffer to represent the substring, and updates the map, associating the buffer with 1 if not present or add 1 to the count of the existing mapping.

        int num = subStrings.merge(cb.limit(ix+len).position(ix), 1, Integer::sum);

We use the returned value to detect whether the merge operation has created a new association in the map, which is only the case if the result is one. In that case, we must not modify the buffer afterwards, hence, create a new one

        if(num == 1) cb = CharBuffer.wrap(s);

Then, we use the result to keep track of the highest number of occurrences

        if(num > most) most = num;

The final step after the loop is simple. We already have the highest number of occurrences, filter the map to keep entries with a matching number (there might be a tie) and create a new map, now converting the buffers to String instances as we don’t want to keep references to the original String and it affects only the few result substrings anyway.

    final int mostOccurences = most; // needed because most is not “effectively final”
    return subStrings.entrySet().stream()
        .filter(e -> e.getValue() == mostOccurences)
        .collect(Collectors.toMap(e -> e.getKey().toString(), Map.Entry::getValue));
Holger
  • 285,553
  • 42
  • 434
  • 765
  • This solution works perfectly, until the last part (with the "len" variable 100.092 still throws an heap space error here), but I can change it a bit and hopefully evade it. Also I tried to delete the charHistogram and it worked at circa the same speed. So can I ask you what is the reason for such a speed? I suggest that its probably because of the charBuffer, and different way how to store the unique substrings in HashMap. – Alexander Taran Jan 24 '18 at 18:14
  • Yes, the biggest impact is the use of the char buffer, as it eliminates the copying overhead, which saves both, memory and time. The hash map also makes it faster than your original approach, but other posted approaches use hash maps too. The last statement creates a map which can contain an arbitrary number of substrings in case of a tie (in the worst case, all substrings, each with an occurrence number of one), so you may just insert a `limit(…)` with a small number between the `filter` and `collect` step. You may use `.limit(1)` if you are happy with just one string… – Holger Jan 25 '18 at 08:09
  • The effectiveness of the histogram heavily depends on the contents of the input string. But even for strings where it doesn’t help, it won’t hurt… – Holger Jan 25 '18 at 08:11
0

you could try using a Map to track the occurrences of substrings, for example:

public class Test {
    public static void main(String[] args) {
        String test = "asdasdagagsjug8afhnqh3gbq29873brfuysbf78sdgy0yg7483wthsddbfahbfasfga78dftg78VGFIBDVGIUASF8928HWEAWD";
        int substringLength = 2;

        Map<String, Integer> tracker = new HashMap<>();

        for(int i = 0; i < test.length() - substringLength + 1; i ++) {
            String subString = test.substring(i, substringLength + i);
            tracker.compute(subString, (k,v) -> v == null ? 1 : v + 1);
        }

        for(String key : tracker.keySet()) {
            System.out.println("Substring: " + key + " has " + tracker.get(key) + " occurences.");
        }
    }
}

In your example you are tracking every occurrence separately. For example, in a String:

ababababababababababababa

you would store each substring separately in your List h. The above code would track the String ab in 1 mapping instead. In fact, it'd print:

Substring: ab has 12 occurences.
Substring: ba has 12 occurences.

I hope that gives you a place to start with.

pandaadb
  • 6,306
  • 2
  • 22
  • 41
  • It still doesnt work for a 1m+ long string, and it seems like this mapping thing works on the same princip as my algorithm, sorry if Im wrong, but it looks similar for me. The problem here is the subString variable, which has the same purpose as in my algorithm, just to represent the current part of the main String in the loop. – Alexander Taran Jan 22 '18 at 17:20
  • What are your heap settings. I just ran this against 159.434.752 characters and it worked fine. – pandaadb Jan 22 '18 at 17:33
  • -Xms4096m -Xmx4096m Im running it in Eclipse. – Alexander Taran Jan 22 '18 at 17:37
  • I am running with eclipse with default configurations. I think you will have to provide an example to reproduce your issue – pandaadb Jan 23 '18 at 10:04
  • Your above code does not cause an issue for me. It runs extremely slow. As for your question: you are right, you are somewhat doing the same mapping in a manual way. Your issue is that you have to search for an existing substring in each iteration. (Your inner for loop). A map has constant lookup time, so you don't search by using a map. – pandaadb Jan 23 '18 at 10:18
  • 1
    `tracker.compute(subString, (k,v) -> v == null ? 1 : v + 1)` can be simplified to `tracker.merge(subString, 1, v -> v + 1)` or `tracker.merge(subString, 1, Integer::sum)` – Holger Jan 23 '18 at 14:53
  • I probably found the core of the problem. The length of the substrings in case of the 1.000.000 long string is specified as 100.092. I tried your solution on a substring length of 2 and it worked. So maybe you could try it on substring length 100k + and tell your findings? – Alexander Taran Jan 24 '18 at 16:35
0

You could iterate through the string one time by adding the location g to g+1 to a hash table, and then use an if to check if the occurence is in the hashtable. For example, abcab would be ab ->2, bc->1, ca->1 in the table.

int length = 2;
String str = "ababkjdklfhcjacajca";
Hashtable<String, Integer> identicalStrings = new Hashtable<String, Integer>();
h.add(str.substring(0, length));

for (int i = 0; i < str.length() - 1; i++) {
   if(!identicalStrings.contains(str.substring(i, i+2)) {
        identicalStrings.numbers.put(str.substring(i, i+2), 1);
   } else {
       identicalStrings.put(str.substring(i, i+2), identicalStrings.get(str.substring(i, i+2)) + 1);
   }

}

I wrote this out real quick so I am not sure if it compiles, but something similar to this should work.

Chris
  • 153
  • 2
  • 15
0

It's a fun study case for using a Map (to represent the number of occurrences for each substring), Pattern and Matcher classes.

Another thing that was also interesting for me was that a substring aa - for example - appears 2 times in aaa; and not 1 time as I initially counted, using replaceAll method (for counting single characters).


My solution
(I've tested it with a 10^8=100,000,000 characters long String and it worked well. The only boundary that seems to exist is the length of the String input)

public static Map<String, Integer> getMostRepeatedSubstring(int length, String str) {
    HashSet<String> possibleSubstrings = new HashSet<>();
    Map<String, Integer> ans = new HashMap<>();
    int max = 0;

    // Create a list of all the unique substrings of "str" with a certain length
    for(int i=0; i<str.length()-length; i++) {
        String curr = str.substring(i, i+length);
        possibleSubstrings.add(curr);
        // "curr" is added only if it doesn't already appear within "possibleSubstrings"
    }

    for(String sub : possibleSubstrings) {
        Pattern pattern = Pattern.compile(sub, Pattern.LITERAL);
        Matcher matcher = pattern.matcher(str);
        int currentOccurrences = 0;
        while (matcher.find())
            currentOccurrences ++;

        if(currentOccurrences > max) {           // We have a new winner!
            max = currentOccurrences;
            ans.clear();
            ans.put(sub, currentOccurrences);
        }
        else if (currentOccurrences == max) {    // We have a tide
            ans.put(sub, currentOccurrences);
        }
    }

    return ans;
}

Edit: Thank you @Holger for the important improvements!

GalAbra
  • 5,048
  • 4
  • 23
  • 42
  • How did you run it? Like Im running it in eclipse and always getting an java heap space error. Maybe I have to create exe or setup something? I already set max heap space to 4096... – Alexander Taran Jan 22 '18 at 21:10
  • @AlexanderTaran I'm using IntelliJ IDEA. Are you sure your string length doesn't exceed 2,147,483,647 characters? – GalAbra Jan 22 '18 at 21:17
  • I probably found out what is the problem, I unfortunately forgot to mention the specified length of the substrings that it has to look for. Its 100.092. And the length of the main string is exactly 1.699.541 chars. I tried your solution on substrings of the length of 2 and it worked well. But the specified length 100.092 is the problem... – Alexander Taran Jan 24 '18 at 16:43