18

We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,

  "a" + "b" + "c" => "abc"
  "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
  "abcdede" + "dedefgh" + "" => "abcdedefgh"
  "abcde" + "d" + "ghlmn" => "abcdedghlmn"
  "abcdef" + "" + "defghl" => "abcdefghl"

Our current algorithm is pretty slow because it uses brute-force to identify the overlapping part between 2 strings. Does any one know an efficient algorithm to do this?

Say we have 2 strings A and B. The algorithm needs to find the longest common substring S so that A ends with S and B starts with S.

Our current brute-force implementation in Java is attached for reference,

public static String concat(String s1, String s2) {
    if (s1 == null)
        return s2;
    if (s2 == null)
        return s1;
    int len = Math.min(s1.length(), s2.length());

    // Find the index for the end of overlapping part
    int index = -1;
    for (int i = len; i > 0; i--) {
        String substring = s2.substring(0, i);
        if (s1.endsWith(substring)) {
            index = i;
            break;
        }
    }
    StringBuilder sb = new StringBuilder(s1);
    if (index < 0) 
        sb.append(s2);
    else if (index <= s2.length())
        sb.append(s2.substring(index));
    return sb.toString();
}
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
ZZ Coder
  • 74,484
  • 29
  • 137
  • 169
  • What's the average length of the strings? – RBarryYoung Aug 17 '09 at 01:57
  • Only mid string is long, average about 100 characters. The prefix and postfix are only about 20 characters. It doesn't really matter. The database conversion only happens once. It will take about several days but only few hours will be spent on the concatenation of strings in the worst case. I am trying to find the optimal solution, just for fun and the challenge. I think I found it. Thanks! – ZZ Coder Aug 17 '09 at 11:51

12 Answers12

29

Most of the other answers have focused on constant-factor optimizations, but it's also possible to do asymptotically better. Look at your algorithm: it's O(N^2). This seems like a problem that can be solved much faster than that!

Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that's the value we're looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.

That gives you an O(n) algorithm. Nice!

    int OverlappedStringLength(string s1, string s2) {
        //Trim s1 so it isn't longer than s2
        if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);

        int[] T = ComputeBackTrackTable(s2); //O(n)

        int m = 0;
        int i = 0;
        while (m + i < s1.Length) {
            if (s2[i] == s1[m + i]) {
                i += 1;
                //<-- removed the return case here, because |s1| <= |s2|
            } else {
                m += i - T[i];
                if (i > 0) i = T[i];
            }
        }

        return i; //<-- changed the return here to return characters matched
    }

    int[] ComputeBackTrackTable(string s) {
        var T = new int[s.Length];
        int cnd = 0;
        T[0] = -1;
        T[1] = 0;
        int pos = 2;
        while (pos < s.Length) {
            if (s[pos - 1] == s[cnd]) {
                T[pos] = cnd + 1;
                pos += 1;
                cnd += 1;
            } else if (cnd > 0) {
                cnd = T[cnd];
            } else {
                T[pos] = 0;
                pos += 1;
            }
        }

        return T;
    }

OverlappedStringLength("abcdef", "defghl") returns 3

Craig Gidney
  • 17,763
  • 5
  • 68
  • 136
  • 2
    Running this against my implementation over a million randomly generated strings (from alphabet a-z, length 8-23, using the same collection of strings for both implementations) yours takes twice as long. Whether this implementation is actually better will depend significantly on the nature of the overlap in the actual strings being concatenated, so as always... ZZ Coder will need to measure against his dataset before deciding what *actually* works best. – jerryjvl Aug 17 '09 at 01:09
  • 1
    Even so, +1 for use of Knuth ;) – jerryjvl Aug 17 '09 at 01:11
  • Yeah, looks like a lot of constant time in this implementation. If the strings are the size of those in the OP, might not be worth it. Gotta profile. – FogleBird Aug 17 '09 at 01:14
  • I agree that your search will be faster for small random strings. In fact, if the input is random, you can achieve an expected-linear-time algorithm by progressively filtering the possible starting positions by checking more and more characters. But relying on the input for anything is a bad idea. What if you get long strings, repetitive strings, strings with only 0s and 1s, etc? You're algorithm's performance relies on a reasonably random input, which may or may not apply in practice (or might stop applying in the future). eg. English character frequencies hurt your performance slightly. – Craig Gidney Aug 17 '09 at 01:56
  • I wasn't intending my measurement or rebuttal to be generic though; keep in mind that we know very little about the actual data ZZ Coder will be running the algorithm against; without further knowledge like that we cannot say anything about which implementation is best. As I said... he'll need to measure with his actual real strings to be able to make an informed decision. – jerryjvl Aug 17 '09 at 03:58
  • 1
    My goal was to find an O(n) solution and I think this is it. Thanks! – ZZ Coder Aug 17 '09 at 11:41
4

You may use a DFA. For example, a string XYZ should be read by the regular expression ^((A)?B)?C. That regular expression will match the longest prefix which matches a suffix of the XYZ string. With such a regular expression you can either match and get the match result, or generate a DFA, on which you can use the state to indicate the proper position for the "cut".

In Scala, the first implementation -- using regex directly -- might go like this:

def toRegex(s1: String) = "^" + s1.map(_.toString).reduceLeft((a, b) => "("+a+")?"+b) r
def concatWithoutMatch(s1 : String, s2: String) = {
  val regex = toRegex(s1)
  val prefix = regex findFirstIn s2 getOrElse ""
  s1 + s2.drop(prefix length)
}

For example:

scala> concatWithoutMatch("abXabXabXac", "XabXacd")
res9: java.lang.String = abXabXabXacd

scala> concatWithoutMatch("abc", "def")
res10: java.lang.String = abcdef

scala> concatWithoutMatch(concatWithoutMatch("abcde", "defgh"), "ghlmn")
res11: java.lang.String = abcdefghlmn
Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
  • Nice :) ... still, it misses out on the (potentially significant) optimisation in my implementation that discards most typical mismatches based on the first and last character of the overlap under consideration. – jerryjvl Aug 16 '09 at 23:16
  • Not really. Consider that we are matching against the second string. For each character of that second string, the match can either be done or not. Once the first non-matching character is found, you know what the maximum possible match is. This *never* has to compare a character more than once. Now, the overhead of creating and compiling the regex might be excessive. – Daniel C. Sobral Aug 16 '09 at 23:30
  • I may be missing something, but although it never compares a character in 's1' more than once, it may have to check the same character in 's2' twice, surely? ... for example in your first example, where 's1' contains 'XabXabXac'? ... it will have to check the first three characters of 's2' at least twice. – jerryjvl Aug 16 '09 at 23:40
  • 2
    Conversion of NFA to DFA can result in exponential increase in states. You've just moved the complexity from brute-force scanning to building DFA states. – Barry Kelly Aug 16 '09 at 23:59
  • Yes, it can result in exponential increase in states, if loops and alternatives are used. The number of states for this regular expression is equal to the size of the first string, plus the initial and the non-accepting terminal state. – Daniel C. Sobral Aug 17 '09 at 11:27
  • @jerryjvl: no, I won't. That's why regex is so popular. – Daniel C. Sobral Aug 17 '09 at 11:29
3

Here's a solution in Python. It should be faster just by not needing to build substrings in memory all the time. The work is done in the _concat function, which concatenates two strings. The concat function is a helper that concatenates any number of strings.

def concat(*args):
    result = ''
    for arg in args:
        result = _concat(result, arg)
    return result

def _concat(a, b):
    la = len(a)
    lb = len(b)
    for i in range(la):
        j = i
        k = 0
        while j < la and k < lb and a[j] == b[k]:
            j += 1
            k += 1
        if j == la:
            n = k
            break
    else:
        n = 0
    return a + b[n:]

if __name__ == '__main__':
    assert concat('a', 'b', 'c') == 'abc'
    assert concat('abcde', 'defgh', 'ghlmn') == 'abcdefghlmn'
    assert concat('abcdede', 'dedefgh', '') == 'abcdedefgh'
    assert concat('abcde', 'd', 'ghlmn') == 'abcdedghlmn'
    assert concat('abcdef', '', 'defghl') == 'abcdefghl'
FogleBird
  • 74,300
  • 25
  • 125
  • 131
  • You mean this: def _concat(a, b): la = len(a) for i in range(la): if a[i:] == b[:la-i]: return a + b[la-i:] return a + b def _concat(a, b): la = len(a) for i in range(la): if a[i:] == b[:la-i]: return a + b[la-i:] return a + b – hughdbrown Aug 17 '09 at 00:01
  • Or run this: print "\ndef _concat(a, b):\n\tla = len(a)\n\tfor i in range(la):\n\t\tif a[i:] == b[:la-i]:\n\t\t\treturn a + b[la-i:]\n\treturn a + b\n".replace(r'\n', '\n').replace(r'\t', '\t') – hughdbrown Aug 17 '09 at 00:15
  • Yeah, but that was pretty good, right? You run it and it formats itself. – hughdbrown Aug 17 '09 at 02:08
  • Anyway, your version creates substrings which might take more time. But since a lot of that happens down in C-land, it might still run faster, I haven't checked. But the same might not happen when ported to C#. – FogleBird Aug 17 '09 at 02:41
  • I did it for the clarity, not the speed. Body of _concat: 5 lines versus 14. – hughdbrown Aug 17 '09 at 04:23
2

How about (pardon the C#):

public static string OverlapConcat(string s1, string s2)
{
    // Handle nulls... never return a null
    if (string.IsNullOrEmpty(s1))
    {
        if (string.IsNullOrEmpty(s2))
            return string.Empty;
        else
            return s2;
    }
    if (string.IsNullOrEmpty(s2))
        return s1;

    // Checks above guarantee both strings have at least one character
    int len1 = s1.Length - 1;
    char last1 = s1[len1];
    char first2 = s2[0];

    // Find the first potential match, bounded by the length of s1
    int indexOfLast2 = s2.LastIndexOf(last1, Math.Min(len1, s2.Length - 1));
    while (indexOfLast2 != -1)
    {
        if (s1[len1 - indexOfLast2] == first2)
        {
            // After the quick check, do a full check
            int ix = indexOfLast2;
            while ((ix != -1) && (s1[len1 - indexOfLast2 + ix] == s2[ix]))
                ix--;
            if (ix == -1)
                return s1 + s2.Substring(indexOfLast2 + 1);
        }

        // Search for the next possible match
        indexOfLast2 = s2.LastIndexOf(last1, indexOfLast2 - 1);
    }

    // No match found, so concatenate the full strings
    return s1 + s2;
}

This implementation does not make any string copies (partial or otherwise) until it has established what needs copying, which should help performance a lot.

Also, the match check first tests the extremeties of the potentially matched area (2 single characters) which in normal english text should give a good chance of avoiding checking any other characters for mismatches.

Only once it establishes the longest match it can make, or that no match is possible at all, will two strings be concatenated. I have used simple '+' here, because I think the optimisation of the rest of the algorithm has already removed most of the inefficiencies in your original. Give this a try and let me know if it is good enough for your purposes.

jerryjvl
  • 19,723
  • 7
  • 40
  • 55
  • Also note that due to the 'LastIndexOf' it does not even consider all possible overlap offsets, only the ones that could potentially match, which means it could do significantly less than O(n) iterations of the `while` loop. – jerryjvl Aug 16 '09 at 23:26
  • I think it's always dangerous to use built-in functions (like LastIndexOf) and assume the implementation is super-efficient. It would be worth re-writing this so it works in any language (given that the question is language agnostic) which would mean no use of String.LastIndexOf in the implementation. That said, the basic premise of the algorithm looks sound. – Phil Aug 16 '09 at 23:33
  • I wanted the answer to stay relatively brief ;) ... implementing a 'LastIndexOf' that does an efficient character scan within the assumptions that can be made in this implementation should not be very complicated (it does not need to do any bounds checks on the second parameter). – jerryjvl Aug 16 '09 at 23:42
  • Also note that if even more performance needs to be squeezed out the character scan done on `s2` through repeated use of `LastIndexOf` may be sub-optimal if `s2` is significantly longer than `s1`, in which case you could make it special case based on the relative lengths of `s1` and `s2`, where the opposite version obviously would need to use `IndexOf` instead, etc. ... this goes a little beyond the scope of the question though. – jerryjvl Aug 16 '09 at 23:45
  • Thinking a bit more... the separate 'single-character check' to quickly eliminate mismatches can be folded into the comparison loop by iterating 'ix' in the opposite direction. – jerryjvl Aug 17 '09 at 00:11
1

Or you could do it in mysql with the following stored function:

DELIMITER //

DROP FUNCTION IF EXISTS concat_with_overlap //

CREATE FUNCTION concat_with_overlap(a VARCHAR(100), b VARCHAR(100))
  RETURNS VARCHAR(200) DETERMINISTIC
BEGIN 
  DECLARE i INT;
  DECLARE al INT;
  DECLARE bl INT;
  SET al = LENGTH(a);
  SET bl = LENGTH(a);
  IF al=0 THEN 
    RETURN b;
  END IF;
  IF bl=0 THEN 
    RETURN a;
  END IF;
  IF al < bl THEN
     SET i = al;
  ELSE
     SET i = bl;
  END IF;

  search: WHILE i > 0 DO
     IF RIGHT(a,i) = LEFT(b,i) THEN
    RETURN CONCAT(a, SUBSTR(b,i+1));
     END IF;
     SET i = i - 1;
  END WHILE search;

  RETURN CONCAT(a,b);
END//

I tried it with your test data:

mysql> select a,b,c,
    -> concat_with_overlap( concat_with_overlap( a, b ), c ) as result 
    -> from testing //
+-------------+---------+--------+-------------+
| a           | b       | c      | result      |
+-------------+---------+--------+-------------+
| a           | b       | c      | abc         |
| abcde       | defgh   | ghlmn  | abcdefghlmn |
| abcdede     | dedefgh |        | abcdedefgh  |
| abcde       | d       | ghlmn  | abcdedghlmn |
| abcdef      |         | defghl | abcdefghl   |
| abXabXabXac | XabXac  |        | abXabXabXac |
+-------------+---------+--------+-------------+
6 rows in set (0.00 sec)
bjelli
  • 9,752
  • 4
  • 35
  • 50
  • This is a very interesting one. I would definitely use it if we use the same database but we are moving from Sybase to MySQL. – ZZ Coder Aug 17 '09 at 11:44
  • so, do it in MySQL after you arrive ;-) – bjelli Aug 17 '09 at 14:52
  • Our new schema only has the combined column. I have to create some temporary columns to do this. It takes us long time to remove columns (days) in MySQL because the database is fairly big so we try to avoid it. – ZZ Coder Aug 17 '09 at 22:29
1

I think this will be pretty quick:

You have two strings, string1 and string2. Look backwards (right to left) through string1 for the first character of string2. Once you have that position, determine if there is overlap. If there isn't, you need to keep searching. If there is you need to determine if there is any possibility for another match.

To do that, simply explore the shorter of the two strings for a recurrence of the overlapping characters. ie: If the location of the match in string1 leaves a short string1 remaining, repeat the initial search from the new starting point in string1. Conversely, if the unmatched portion of string2 is shorter, search it for a repeat of the overlapping characters.

Repeat as required.

Job done!

This doesn't require much in terms of memory allocation (all searching done in place, just need to allocate the resultant string buffer) and only requires (at most) one pass of one of the strings being overlapped.

Phil
  • 174
  • 7
  • It is better to start with the longest possible overlap and work backwards from there... see my solution ;) – jerryjvl Aug 16 '09 at 23:28
  • I disagree. Your algorithm is slightly shorter, perhaps, but the implementation of String.LastIndexOf is probably implementing my algorithm. – Phil Aug 16 '09 at 23:30
  • My use of `LastIndexOf` scans for a single character... I don't see how that corresponds to your algorithm at all... – jerryjvl Aug 16 '09 at 23:32
  • There's some homework for you, then. Implement both algorithms, run them against the same set of a few million randomly generated strings and measure the relative performance of each. – Phil Aug 16 '09 at 23:41
  • I'll be keen to see the results if you do. – Phil Aug 16 '09 at 23:42
  • Okay... in trying to implement your algorithm: once the initial overlap is found, matching the recurring part will require looking in *both* string1 and string2 for the overlapped part... only if it occurs in both have you found a longer match. (Also, your solution does not use a short-cut to quickly eliminate likely mismatches) ... I'd still happily perform a benchmark, but you will have to be a little more explicit about how to efficiently handle the 'find a longer match' portion of your answer first. – jerryjvl Aug 17 '09 at 00:05
  • Not quite right. If string2 is not self-similar then you don't need to keep searching either string. If it is self-similar then you have the length of the self-similarity, and therefore can do one very direct comparison for a match - no searching required at all. There's probably an optimisation that can be made around single character overlaps (don't look for self-similarities in that instance), etc. You seem cluey, you'll work it out. PS: There are also some global optimisations to apply in the context of the originally stated problem. eg: passing an output stream. String op+ is bad. – Phil Aug 17 '09 at 00:16
  • Why look from Right to Left? Look from Left to Right, and the first match you find will be the one with the largest overlap. – Kirk Broadhurst Aug 17 '09 at 00:58
  • Okay, it appears there is a further problem... when comparing "hilarious gag" and "gag reflex" for instance, this method will find "g" as the first match, try to look further and decide that "a" does not match, so no point looking further, resulting in "hilarious gagag reflex", which is not right... I think you'll have to go with 'longest-match-first' like Kirk suggests to avoid this problem. – jerryjvl Aug 17 '09 at 01:02
  • We go right to left on the assumption that the overlap is small and that self-similarity is unlikely to be present in many string2's – Phil Aug 17 '09 at 09:41
  • Actually (working r-l), "g" will match, "a" won't match, "g" will match, then "gag" will match. No self-similarity in string2 means we're done. – Phil Aug 17 '09 at 09:44
  • OK, have done the experiment. There's nothing in it, and depends upon the data set. If string2 is self-similar jerryjvl's code is faster, if string2 is not, my code is faster - but it's only +/- 20% in either case. I haven't gone all the way with extensive testing, just far enough to get to the point where it isn't worth continuing any further. Kirk's code is at least 2x slower than either jerrjvl's or mine and has some boundary problems that can introduce an infinite loop, so debug that before you use it. – Phil Aug 17 '09 at 13:49
1

I'm trying to make this C# as pleasant to read as possible.

    public static string Concatenate(string s1, string s2)
    {
        if (string.IsNullOrEmpty(s1)) return s2;
        if (string.IsNullOrEmpty(s2)) return s1;
        if (s1.Contains(s2)) return s1;
        if (s2.Contains(s1)) return s2;

        char endChar = s1.ToCharArray().Last();
        char startChar = s2.ToCharArray().First();

        int s1FirstIndexOfStartChar = s1.IndexOf(startChar);
        int overlapLength = s1.Length - s1FirstIndexOfStartChar;

        while (overlapLength >= 0 && s1FirstIndexOfStartChar >=0)
        {
            if (CheckOverlap(s1, s2, overlapLength))
            {
                return s1 + s2.Substring(overlapLength);
            }

            s1FirstIndexOfStartChar = 
                s1.IndexOf(startChar, s1FirstIndexOfStartChar);
            overlapLength = s1.Length - s1FirstIndexOfStartChar;

        }

        return s1 + s2;
    }

    private static bool CheckOverlap(string s1, string s2, int overlapLength)
    {
        if (overlapLength <= 0)
            return false;

        if (s1.Substring(s1.Length - overlapLength) == 
            s2.Substring(0, overlapLength))
            return true;

        return false;            
    }

EDIT: I see that this is almost the same as jerryjvl's solution. The only difference is that this will work with the "abcde", "d" case.

Kirk Broadhurst
  • 27,836
  • 16
  • 104
  • 169
  • This still makes string copies for the `CheckOverlap`, ... I would at the least re-implement this helper method with an in-place comparison loop. – jerryjvl Aug 17 '09 at 00:09
  • I am not sure I understand your "the only difference is ..." ... my algorithm gives the correct result for that example from the original problem... – jerryjvl Aug 17 '09 at 01:18
0

Why not just do something like this. First get the first character or word (which is going to signify the overlap) in the three columns.

Then, start to add the first string to a stringbuffer, one character at a time.

Each time look to see if you reached a part that is overlapped with the second or third string.

If so then start concatenating the string that also contains what is in the first string.

When done start, if no overlap, start with the second string and then the third string.

So in the second example in the question I will keep d and g in two variables.

Then, as I add the first string abc come from the first string, then I see that d is also in the second string so I shift to adding from the second string def are added from string 2, then I move on and finish with string 3.

If you are doing this in a database why not just use a stored procedure to do this?

James Black
  • 41,583
  • 10
  • 86
  • 166
  • Our DBA has decided that this operation is too complicated to do in stored procedure. Besides, we are moving from Sybase to MySQL. So this will be done by a batch job, which can be written in any language. – ZZ Coder Aug 16 '09 at 22:46
0

If you're doing it outside the database, try perl:

sub concat {
  my($x,$y) = @_;

  return $x if $y eq '';
  return $y if $x eq '';

  my($i) = length($x) < length($y) ?  length($x) : length($y);
  while($i > 0) {
      if( substr($x,-$i) eq substr($y,0,$i) )  {
          return $x . substr($y,$i);
      }
      $i--;
  }
  return $x . $y;
}

It's exactly the same algorithms as yours, I'm just curios if java or perl is faster ;-)

bjelli
  • 9,752
  • 4
  • 35
  • 50
0

This problem seems like a variation of the longest common sub-sequence problem, which can be solved via dynamic programming.

http://www.algorithmist.com/index.php/Longest_Common_Subsequence

0

Heres a perl -pseudo oneliner:

$_ = s1.s2;

s/([\S]+)\1/\1/;

perl regex's are pretty efficient, you can look up what algo they are using but they definitely implement some type of FSM etc so will get u results in pretty good O(..).

0

Here is a Java implementation which finds the maximum overlap between two strings with length N and M in something like O(min(N,M)) operations ~ O(N).

I had the same idea as @sepp2k:s now deleted answer, and worked a bit further on it. Seems to work fine. The idea is to iterate through the first string and start tracking once you find something which matches the start of the second string. Figured out that you might need to do multiple simultaneous trackings if false and true matches are overlapping. At the end you choose the longest track.

I havent worked out the absolutely worst case yet, with maximal overlap between matches, but I don't expect it to spiral out of control since I think that you cannot overlap arbitrary many matches. Normally you only track one or two matches at a time: the candidates are removed as soon as there is a mismatch.

static class Candidate {
    int matchLen = 0;
}

private String overlapOnce(@NotNull final String a, @NotNull final String b) {
    final int maxOverlap = Math.min(a.length(), b.length());
    final Collection<Candidate> candidates = new LinkedList<>();
    for (int i = a.length() - maxOverlap; i < a.length(); ++i) {
        if (a.charAt(i) == b.charAt(0)) {
            candidates.add(new Candidate());
        }
        for (final Iterator<Candidate> it = candidates.iterator(); it.hasNext(); ) {
            final Candidate candidate = it.next();
            if (a.charAt(i) == b.charAt(candidate.matchLen)) {
                //advance
                ++candidate.matchLen;
            } else {
                //not matching anymore, remove
                it.remove();
            }
        }

    }
    final int matchLen = candidates.isEmpty() ? 0 :
            candidates.stream().map(c -> c.matchLen).max(Comparator.comparingInt(l -> l)).get();
    return a + b.substring(matchLen);
}

private String overlapOnce(@NotNull final String... strings) {
    return Arrays.stream(strings).reduce("", this::overlapOnce);
}

And some tests:

@Test
public void testOverlapOnce() throws Exception {
    assertEquals("", overlapOnce("", ""));
    assertEquals("ab", overlapOnce("a", "b"));
    assertEquals("abc", overlapOnce("ab", "bc"));
    assertEquals("abcdefghqabcdefghi", overlapOnce("abcdefgh", "efghqabcdefghi"));
    assertEquals("aaaaaabaaaaaa", overlapOnce("aaaaaab", "baaaaaa"));
    assertEquals("ccc", overlapOnce("ccc", "ccc"));
    assertEquals("abcabc", overlapOnce("abcabc", "abcabc"));

    /**
     *  "a" + "b" + "c" => "abc"
     "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
     "abcdede" + "dedefgh" + "" => "abcdedefgh"
     "abcde" + "d" + "ghlmn" => "abcdedghlmn"
     "abcdef" + "" + "defghl" => "abcdefghl"
     */
    assertEquals("abc", overlapOnce("a", "b", "c"));
    assertEquals("abcdefghlmn", overlapOnce("abcde", "defgh", "ghlmn"));
    assertEquals("abcdedefgh", overlapOnce("abcdede", "dedefgh"));
    assertEquals("abcdedghlmn", overlapOnce("abcde", "d", "ghlmn"));
    assertEquals("abcdefghl", overlapOnce("abcdef", "", "defghl"));


    // Consider str1=abXabXabXac and str2=XabXac. Your approach will output abXabXabXacXabXac because by
    // resetting j=0, it goes to far back.
    assertEquals("abXabXabXac", overlapOnce("abXabXabXac", "XabXac"));

    // Try to trick algo with an earlier false match overlapping with the real match
    //  - match first "aba" and miss that the last "a" is the start of the
    // real match
    assertEquals("ababa--", overlapOnce("ababa", "aba--"));
}
Alexander Torstling
  • 18,552
  • 7
  • 62
  • 74