Given a word and a text, return the count of the occurrences of anagrams of the word in the text

Question

For eg. word is for and the text is forxxorfxdofr, anagrams of for will be ofr, orf, fro, etc. So the answer would be 3 for this particular example.

Here is what I came up with.

#include<iostream>
#include<cstring>

using namespace std;

int countAnagram (char *pattern, char *text)
{
    int patternLength = strlen(pattern);
    int textLength = strlen(text);

    int dp1[256] = {0}, dp2[256] = {0}, i, j;

    for (i = 0; i < patternLength; i++)
    {
        dp1[pattern[i]]++;
        dp2[text[i]]++;
    }

    int found = 0, temp = 0;

    for (i = 0; i < 256; i++)
    {
        if (dp1[i]!=dp2[i])
        {
            temp = 1;
            break;
        }
    }

    if (temp == 0)
        found++;


    for (i = 0; i < textLength - patternLength; i++)
    {
        temp = 0;
        dp2[text[i]]--;
        dp2[text[i+patternLength]]++;
        for (j = 0; j < 256; j++)
        {
            if (dp1[j]!=dp2[j])
            {
                temp = 1;
                break;
            }
        }
        if (temp == 0)
            found++;
    }
    return found;
}


int main()
{
    char pattern[] = "for";
    char text[] = "ofrghofrof";

    cout << countAnagram(pattern, text);

}

Does there exist a faster algorithm for the said problem?

Are overlapping anagrams allowed? If the text is `frof`, is the answer 1 or 2? — aschepler, Oct 03 '13 at 17:12
I might be missing the point, but why aren't you just searching the string for anagrams of "for"? Your logic seems really complicated — bengoesboom, Oct 03 '13 at 17:16
I am not sure what algorithm you are proposing. But it seems your algorithm would have a time complexity O(NM) where N = length of text and M = length of pattern. — user2560730, Oct 03 '13 at 17:20
Yes, there are faster algorithms, but many depend on the length of the word and the length of the text. For small sizes of text, a brute-force approach may be faster than a highly optimized version. — Thomas Matthews, Oct 03 '13 at 19:13

score 0 · Answer 1 · answered Oct 03 '13 at 19:28

Most of the time will be spent searching, so to make the algorithm more time efficient, the objective is to reduce the quantities of searches or optimize the search.

Method 1: A table of search starting positions.

Create a vector of lists, one vector slot for each letter of the alphabet. This can be space-optimized later.

Each slot will contain a list of indices into the text.

Example text: forxxorfxdofr

Slot  List  
'f'    0 --> 7 --> 11  
'o'    1 --> 5 --> 10  
'r'    2 --> 6 --> 12

For each word, look up the letter in the vector to get a list of indexes into the text. For each index in the list, compare the text string position from the list item to the word.

So with the above table and the word "ofr", the first compare occurs at index 1, second compare at index 5 and last compare at index 10.

You could eliminate near-end of text indices where (index + word length > text length).

score 0 · Answer 2 · edited May 23 '17 at 11:57

0

You can use the commutativity of multiplication, along with uniqueness of primal decomposition. This relies on my previous answer here

Create a mapping from each character into a list of prime numbers (as small as possible). For e.g. a-->2, b-->3, c-->5, etc.. This can be kept in a simple array.

Now, convert the given word into the multiplication of the primes matching each of its characters. This results will be equal to a similar multiplication of any anagram of that word.

Now sweep over the array, and at any given step, maintain the multiplication of the primes matching the last L characters (where L is the length of your word). So every time you advance you do

mul = mul * char2prime(text[i]) / char2prime(text[i-L])

Whenever this multiplication equals that of your word - increment the overall counter, and you're done

Note that this method would work well on short words, but the primes multiplication can overflow a 64b var pretty fast (by ~9-10 letters), so you'll have to use a large number math library to support longer words.

edited May 23 '17 at 11:57

Community

1
1

answered Oct 04 '13 at 04:59

Leeor

19,260
5
56
87

Why don't you just use 2 for each letter in the search word, and 3 otherwise? That way you can have more letters in the string (up to `log_3(2^31-1)` letters). And a more efficient one (virtually without limit) would be just use an array to store the count of each letter in the last `L` characters. <-- the answer given in the duplicate link – justhalf Oct 04 '13 at 05:28
@justhalf, that won't ensure unique identification. If the search word is `for`, you'll also hit `ffr` or `rro` etc.. As for using an array - it's possible, although with additional space, as well as comparing the array values on each iteration - this solution should have better complexity I think, and anyway - it's different :) – Leeor Oct 04 '13 at 07:35
Ah, ya I forgot about that. About the array, since the array size is only 26 (or 52), I think that doesn't really matter. =) – justhalf Oct 04 '13 at 07:50

score -1 · Answer 3 · answered Oct 04 '13 at 04:11

This algorithm is reasonably efficient if the pattern to be anagrammed is so short that the best way to search it is to simply scan it. To allow longer patterns, the scans represented here by the 'for jj' and 'for mm' loops could be replaced by more sophisticated search techniques.

// sLine -- string to be searched
// sWord -- pattern to be anagrammed
// (in this pseudo-language, the index of the first character in a string is 0)
// iAnagrams -- count of anagrams found

iLineLim = length(sLine)-1
iWordLim = length(sWord)-1

// we need a 'deleted' marker char that will never appear in the input strings
chNil = chr(0)

iAnagrams = 0 // well we haven't found any yet have we
// examine every posn in sLine where an anagram could possibly start
for ii from 0 to iLineLim-iWordLim do {
  chK = sLine[ii]
  // does the char at this position in sLine also appear in sWord
  for jj from 0 to iWordLim do {
    if sWord[jj]=chK then { 
      // yes -- we have a candidate starting posn in sLine

      // is there an anagram of sWord at this position in sLine
      sCopy = sWord // make a temp copy that we will delete one char at a time
      sCopy[jj] = chNil // delete the char we already found in sLine
      // the rest of the anagram would have to be in the next iWordLim positions
      for kk from ii+1 to ii+iWordLim do {
        chK = sLine[kk]
        cc = false
        for mm from 0 to iWordLim do { // look for anagram char
          if sCopy[mm]=chK then { // found one
            cc = true
            sCopy[mm] = chNil // delete it from copy
            break // out of 'for mm'
          }
        }
        if not cc then break // out of 'for kk' -- no anagram char here
      }
      if cc then { iAnagrams = iAnagrams+1 }

      break // out of 'for jj'
    }
  }
}

-Al.

-1. The question is tagged `c++`, while this is certainly not C++. — Daniel Kamil Kozar, Oct 04 '13 at 07:27

Given a word and a text, return the count of the occurrences of anagrams of the word in the text

3 Answers3

Method 1: A table of search starting positions.