10

This is an interview question (phone screen): write a function (in Java) to find all permutations of a given word that appear in a given text. For example, for word abc and text abcxyaxbcayxycab the function should return abc, bca, cab.

I would answer this question as follows:

  • Obviously I can loop over all permutations of the given word and use a standard substring function. However it might be difficult (for me right now) to write code to generate all word permutations.

  • It is easier to loop over all text substrings of the word size, sort each substring and compare it with the "sorted" given word. I can code such a function immediately.

  • I can probably modify some substring search algorithm but I do not remember these algorithms now.

How would you answer this question?

Philipp Reichart
  • 20,771
  • 6
  • 58
  • 65
Michael
  • 41,026
  • 70
  • 193
  • 341

6 Answers6

12

This is probably not the most efficient solution algorithmically, but it is clean from a class design point of view. This solution takes the approach of comparing "sorted" given words.

We can say that a word is a permutation of another if it contains the same letters in the same number. This means that you can convert the word from a String to a Map<Character,Integer>. Such conversion will have complexity O(n) where n is the length of the String, assuming that insertions in your Map implementation cost O(1).

The Map will contain as keys all the characters found in the word and as values the frequencies of the characters.

Example. abbc is converted to [a->1, b->2, c->1]

bacb is converted to [a->1, b->2, c->1]

So if you have to know if two words are one the permutation of the other, you can convert them both into maps and then invoke Map.equals.

Then you have to iterate over the text string and apply the transformation to all the substrings of the same length of the words that you are looking for.

Improvement proposed by Inerdial

This approach can be improved by updating the Map in a "rolling" fashion.

I.e. if you're matching at index i=3 in the example haystack in the OP (the substring xya), the map will be [a->1, x->1, y->1]. When advancing in the haystack, decrement the character count for haystack[i], and increment the count for haystack[i+needle.length()].

(Dropping zeroes to make sure Map.equals() works, or just implementing a custom comparison.)

Improvement proposed by Max

What if we also introduce matchedCharactersCnt variable? At the beginning of the haystack it will be 0. Every time you change your map towards the desired value - you increment the variable. Every time you change it away from the desired value - you decrement the variable. Each iteration you check if the variable is equal to the length of needle. If it is - you've found a match. It would be faster than comparing the full map every time.

Pseudocode provided by Max:

needle = "abbc"
text = "abbcbbabbcaabbca"

needleSize = needle.length()
//Map of needle character counts
targetMap = [a->1, b->2, c->1]

matchedLength = 0
curMap = [a->0, b->0, c->0]
//Initial map initialization
for (int i=0;i<needle.length();i++) {
    if (curMap.contains(haystack[i])) {
        matchedLength++
        curMap[haystack[i]]++
    }
}

if (matchedLength == needleSize) {
    System.out.println("Match found at: 0");
}

//Search itself
for (int i=0;i<haystack.length()-needle.length();i++) {
    int targetValue1 = targetMap[haystack[i]]; //Reading from hashmap, O(1)
    int curValue1 = curMap[haystack[i]]; //Another read
    //If we are removing beneficial character
    if (targetValue1 > 0 && curValue1 > 0 && curValue1 <= targetValue1) {       
        matchedLength--;
    }
    curMap[haystack[i]] = curValue1 + 1; //Write to hashmap, O(1)


    int targetValue2 = targetMap[haystack[i+needle.length()]] //Read
    int curValue2 = curMap[haystack[i+needle.length()]] //Read
    //We are adding a beneficial character
    if (targetValue2 > 0 && curValue2 < targetValue2) { //If we don't need this letter at all, the amount of matched letters decreases
        matchedLength++;
    }
    curMap[haystack[i+needle.length()]] = curValue2 + 1; //Write

    if (matchedLength == needleSize) {
        System.out.println("Match found at: "+(i+1));
    }
}

//Basically with 4 reads and 2 writes which are 
//independent of the size of the needle,
//we get to the maximal possible performance: O(n)
Vitaly Olegovitch
  • 3,509
  • 6
  • 33
  • 49
  • this answer seems incomplete. You say how you are going to canonicalize the word, but say nothing about finding permutations in the text. Would you be using the same idea as the posters 2nd? – Colin D May 23 '12 at 20:58
  • 1
    When combined with the OP's second idea, this approach can be improved by updating the Map in a "rolling" fashion. I.e. if you're matching at index `i=3` in the example haystack in the OP (the substring `xya`), the map will be `[a->1, x->1, y->1]`. When advancing in the haystack, decrement the character count for `haystack[i]`, and increment the count for `haystack[i+needle.length()]`. (Dropping zeroes to make sure `Map.equals()` works, or just implementing a custom comparison.) – millimoose May 23 '12 at 21:02
  • @Inerdial your improvement is really elegant! Congrats!! – Vitaly Olegovitch May 23 '12 at 21:05
  • @ColinD You are ritht. I have updated my answer. If you think it is still incomplete, feel free to edit it. :) – Vitaly Olegovitch May 23 '12 at 21:06
  • 3
    What if we also introduce `matchedCharactersCnt` variable? At the beginning of the haystack it will be 0. Every time you change your map **towards** the desired value - you increment the variable. Every time you change it **away** from the desired value - you decrement the variable. Each iteration you check if the variable == length of needle. If it is - you've found a match. It would be faster than comparing the full map every time. – bezmax May 23 '12 at 21:25
  • That would require 2 additional `map.get()` on each iteration (to see if it's `toward` or `away` from desired value), which is still way faster than comparing 2 maps on each iteration (especially if needle is large). – bezmax May 23 '12 at 21:31
  • `apply the transformation to all the substrings of the same length of the words that you are looking for` - This is hard isn't it? – Chip May 23 '12 at 21:41
  • @Chip it's slow and not optimal, but no, not hard. – bezmax May 23 '12 at 21:47
  • @Max Sorry I should have clarified. I meant slow - hard for the computer. As I understand it, you have n choose m substrings right? Where N is the large string and M is the small string. Thats a lot!! – Chip May 23 '12 at 21:54
  • You don't need a map. You can use a StringBuilder as I showed in my answer (my notYetFounds var). I think this is more performant then building a map of frequency. – Andrea Parodi May 23 '12 at 23:00
  • @Max what happends when you are looking for permutations of `aabb` and you get a permutation of `abbb`? I think it is better to use a `Map` of matched characters. If it is empty, you have found a permutation. – Vitaly Olegovitch May 24 '12 at 07:33
  • @VitalijZadneprovskij it will still work ok in that case. Check out the pseudocode on the algorithm I proposed: http://pastebin.com/H4GLY0zb . Basically for `abbb` you will get `length=3` because of substring `[abb]b`, last `b` won't increase the length, because it's 3-rd one, and 3 > 2 (we are not moving towards our target, but away from it). – bezmax May 24 '12 at 11:16
5

To find a permutation of a string you can use number theory. But you will have to know the 'theory' behind this algorithm in advance before you can answer the question using this algorithm.

There is a method where you can calculate a hash of a string using prime numbers. Every permutation of the same string will give the same hash value. All other string combination which is not a permutation will give some other hash value.

The hash-value is calculated by c1 * p1 + c2 * p2 + ... + cn * pn where ci is a unique value for the current char in the string and where pi is a unique prime number value for the ci char.

Here is the implementation.

public class Main {
    static int[] primes = new int[] { 2, 3, 5, 7, 11, 13, 17, 
        19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 
        73, 79, 83, 89, 97, 101, 103 };

    public static void main(String[] args) {        
        final char[] text = "abcxaaabbbccyaxbcayaaaxycab"
            .toCharArray();     
        char[] abc = new char[]{'a','b','c'};       
        int match = val(abc);                   
        for (int i = 0; i < text.length - 2; i++) {
            char[] _123 = new char[]{text[i],text[i+1],text[i+2]};          
            if(val(_123)==match){
                System.out.println(new String(_123) );      
            }
        }
    }   
    static int p(char c) {
        return primes[(int)c - (int)'a'];
    }   
    static int val(char[] cs) {
        return 
        p(cs[0])*(int)cs[0] + p(cs[1])*(int)cs[1] + p(cs[2])*(int)cs[2];        
    }
}

The output of this is: abc bca cab

Kunukn
  • 2,136
  • 16
  • 16
  • This is not correct. The correct way to calculate such a hash would be to use p(cs[0]) * p(cs[1]) * p(cs[2]). It's ok for small strings, for big ones, however, you will have to use BigInteger for the hash, so it has a limited applicability. – vladich Oct 14 '18 at 03:43
3

You should be able to do this in a single pass. Start by building a map that contains all the characters in the word you're searching for. So initially the map contains [a, b, c].

Now, go through the text one character at a time. The loop looks something like this, in pseudo-code.

found_string = "";
for each character in text
    if character is in map
        remove character from map
        append character to found_string
        if map is empty
            output found_string
            found_string = ""
            add all characters back to map
        end if
    else
        // not a permutation of the string you're searching for
        refresh map with characters from found_string
        found_string = ""
    end if
end for

If you want unique occurrences, change the output step so that it adds the found strings to a map. That'll eliminate duplicates.

There's the issue of words that contain duplicated letters. If that's a problem, make the key the letter and the value a count. 'Removing' a character means decrementing its count in the map. If the count goes to 0, then the character is in effect removed from the map.

The algorithm as written won't find overlapping occurrences. That is, given the text abcba, it will only find abc. If you want to handle overlapping occurrences, you can modify the algorithm so that when it finds a match, it decrements the index by one minus the length of the found string.

That was a fun puzzle. Thanks.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Agree: a fun puzzle. I just note that the code in my answer is affected by a the duplicated letters problem. I'll edit my code inspired from your answer. – Andrea Parodi May 23 '12 at 22:20
1

This is what I would do - set up a flag array with one element equal to 0 or 1 to indicate whether that character in STR had been matched

Set the first result string RESULT to empty.

for each character C in TEXT:

Set an array X equal to the length of STR to all zeroes.

for each character S in STR: If C is the JTH character in STR, and X[J] == 0, then set X[J] <= 1 and add C to RESULT. If the length of RESULT is equal to STR, add RESULT to a list of permutations and set the elements of X[] to zeroes again.

If C is not any character J in STR having X[J]==0, then set the elements of X[] to zeroes again.

A B
  • 4,068
  • 1
  • 20
  • 23
1

The second approach seems very elegant to me and should be perfectly acceptable. I think it scales at O(M * N log N), where N is word length and M is text length.

I can come up with a somewhat more complex O(M) algorithm:

  1. Count the occurrence of each character in the word
  2. Do the same for the first N (i.e. length(word)) characters of the text
  3. Subtract the two frequency vectors, yielding subFreq
  4. Count the number of non-zeroes in subFreq, yielding numDiff
  5. If numDiff equals zero, there is a match
  6. Update subFreq and numDiff in constant time by updating for the first and after-last character in the text
  7. Go to 5 until reaching the end of the text

EDIT: See that several similar answers have been posted. Most of this algorithm is equivalent to the rolling frequency counting suggested by others. My humble addition is also updating the number of differences in a rolling fashion, yielding an O(M+N) algorithm rather than an O(M*N) one.

EDIT2: Just saw that Max has basically suggested this in the comments, so brownie points to him.

smocking
  • 3,689
  • 18
  • 22
  • O(M+N) is a huge improvement over O(M*N) :) – Chip May 23 '12 at 22:01
  • I'm not sure why is your algorithm `O(M+N)`, are you assuming that reading and writing into Map is `O(N)`? I believe your algorithm is `O(M)`. For a proper Map implementation (suiting this task) reading/writing would be `O(1)`. – bezmax May 23 '12 at 22:12
  • @Max its actually O(M) since N < M. I'm sure you thought N was the text size :) I did too. – Chip May 23 '12 at 22:14
  • @Chip yes, sorry, I thought N is length of haystack. Fixed my comment. – bezmax May 23 '12 at 22:15
  • Steps 1-4 are `O(N)` and steps 5-7 are `O(M)`, yielding `O(M+N)` for the entire algorithm. (darn mixed them up myself too) – smocking May 23 '12 at 22:15
  • @smocking yes but since O(M+N) can be max of O(2M), we can simplify this to O(M) – Chip May 23 '12 at 22:16
  • @smocking Ah, ok then. Here's somewhat-pseudocode for that algorithm: http://pastebin.com/H4GLY0zb – bezmax May 23 '12 at 22:17
  • ...but yeah maybe it's `O(M)` if you know that `N – smocking May 23 '12 at 22:19
  • @smocking When `N > M`, the algorithm can be simplified to `O(1)` :D – bezmax May 23 '12 at 22:20
1

This code should do the work:

import java.util.ArrayList;
import java.util.List;

public class Permutations {
    public static void main(String[] args) {
        final String word = "abc";
        final String text = "abcxaaabbbccyaxbcayxycab";
        List<Character> charsActuallyFound = new ArrayList<Character>();
        StringBuilder match = new StringBuilder(3);

        for (Character c : text.toCharArray()) {
            if (word.contains(c.toString()) && !charsActuallyFound.contains(c)) {
                charsActuallyFound.add(c);
                match.append(c);
                if (match.length()==word.length())
                {
                    System.out.println(match);
                    match = new StringBuilder(3);
                    charsActuallyFound.clear();
                }
            } else {
                match = new StringBuilder(3);
                charsActuallyFound.clear();
            }
        }
    }
}

The charsActuallyFound List is used to keep track of character already found in the loop. It is needed to avoid mathing "aaa" "bbb" "ccc" (added by me to the text you specified).

After further reflection, I think my code only work if the given word has no duplicate characters. The code above correctly print

abc
bca
cab

but if you seaarch for the word "aaa", then nothing is printed, because each char can not be matched more than one time. Inspired from Jim Mischel answer, I edit my code, ending with this:

import java.util.ArrayList;
import java.util.List;

public class Permutations {
    public static void main(String[] args) {
        final String text = "abcxaaabbbccyaxbcayaaaxycab";

        printMatches("aaa", text);
        printMatches("abc", text);
    }

    private static void printMatches(String word, String text) {
        System.out.println("matches for "+word +" in "+text+":");

        StringBuilder match = new StringBuilder(3);
        StringBuilder notYetFounds=new StringBuilder(word);

        for (Character c : text.toCharArray()) {
            int idx = notYetFounds.indexOf(c.toString());
            if (idx!=-1) {
               notYetFounds.replace(idx,idx+1,"");

                match.append(c);
                if (match.length()==word.length())
                {
                    System.out.println(match);
                    match = new StringBuilder(3);
                    notYetFounds=new StringBuilder(word);
                }
            } else {
                match = new StringBuilder(3);
                notYetFounds=new StringBuilder(word);
            }
        }
        System.out.println();
    }

}

This give me following output:

matches for aaa in abcxaaabbbccyaxbcayaaaxycab:
aaa
aaa

matches for abc in abcxaaabbbccyaxbcayaaaxycab:
abc
bca
cab

Did some benchmark, the code above found 30815 matches of "abc" in a random string of 36M in just 4,5 seconds. As Jim already said, thanks for this puzzle...

Andrea Parodi
  • 5,534
  • 27
  • 46
  • 1. Map would yield MUCH more performance. This part in your code `notYetFounds.indexOf(c.toString());` makes whole algorithm to be a whooping `O(needle.length()*haystack.length())` complexity, as `indexOf` complexity in your case is basically `O(needle.length())`. However reading/writing into a map is `O(1)` complexity. Therefore, algorithms that use Map result in `O(haystack.length())` complexity which is a magnitude faster (especially for huge needles). – bezmax May 24 '12 at 05:00
  • 2. Your algorithm can't handle overlapping needles, and I don't see a way to modify it so it would handle them. For example: `haystack="abcba"` and `needle="abc"` would result in two matches: `[abc]ba` and `ab[cba]`, while your algorithm produces only first match (and then resets the buffers). – bezmax May 24 '12 at 05:04
  • Yes, I saw. if needle increase in length the code will start start to perform slowly. I'll try to code another version with maps. Oh and no, overlapping words are not matched by my code... – Andrea Parodi May 24 '12 at 06:41
  • Max: I did some other test. The performance of my code is not linear on needle.length(). While with a needle of 3 chars it took 453 ms, with one of 405 chars it took 2515 ms. Using freq maps is faster than using StringBuilder when needle.legth > ~ 140. I'll edit my answer to include the source code of the map version. – Andrea Parodi May 24 '12 at 07:35
  • Mmh, I was totally wrong: the code using frequency map is ALWAYS fast than that using string builder. Both version performance are slowed by the fact that i was rebuilding the builder (or the map) for every character, even if they are untouched! Changes this make both version faster, and the frequency map code is now performing better for eahc leght of searched word. – Andrea Parodi May 24 '12 at 08:37