Implementing an efficent algorithm to find the intersection of two strings

Question

Implement an algorithm that takes two strings as input, and returns the intersection of the two, with each letter represented at most once.

Algo: (considering language used will be c#)

Convert both strings into char array
take the smaller array and generate a hash table for it with key as the character and value 0
Now Loop through the other array and increment the count in hash table if that char is present in it.
Now take out all char for hash table whose value is > 0.
These are intersection values.

This is an O(n), solution but is uses extra space, 2 char arrays and a hash table

Can you guys think of better solution than this?

He already suggests an algorithms and asks if anyone knows how to do it better. — sharptooth, Jul 07 '09 at 05:16
hey....see my algo above, I need to know if we can solve this in O(n) time without using extra space — Learner, Jul 07 '09 at 05:17
I don't C#, so I don't know, but wouldn't this be perfect for a Set (such as found in Java or Python)? — nilamo, Jul 07 '09 at 05:17
I don't want to use any built-in set data structure........so I would love to avoid hash tables if we have a solution without it — Learner, Jul 07 '09 at 05:20
You don't need to maintain the character count, so the value in the hash table can be a simple boolean. — fishlips, Jul 07 '09 at 05:22
Just as an aside, where are you doing your cs course. I have liked your questions so far, your instructor is giving you valuable problems, these are the kinds of things you will actually come up against when you are programming for a living. — Tim Jarvis, Jul 07 '09 at 05:27
Hey Tim........ I am learning programming so I have picked up these questions from net....... It cool that u liked my questions — Learner, Jul 07 '09 at 05:35
Just as an aside, can a user continually use this site as a buffer for interview prep? It looks like this user has admitted they are going for an interview for a SDET position at Microsoft, and all their questions are asking in full items from the Microsoft interview questions (http://sellsbrothers.com/fun/msiview/default.aspx?content=question.htm). I know he/she is adding a psudo solution, but they are essentially using SO as a psudocode->c# translation service. Is this faux-pas? Just wondering <_ — glasnt, Jul 07 '09 at 05:39
Hey TomatoSandwich, This is just a contribution to the community so that any1 one else can use this as a reference for his/her prep... And helping others is my motive, also I am making every attempt to come up with initial solution, I don't think asking question is of any harm as it helps me as well as helps others — Learner, Jul 07 '09 at 05:53

score 11 · Answer 1 · answered Jul 07 '09 at 05:28

11

How about this ...

var s1 = "aabbccccddd";
var s2 = "aabc";

var ans = s1.Intersect(s2);

answered Jul 07 '09 at 05:28

JP Alioto

44,864
6
88
112

While that's the easy way, the OP is asking for an algorithm to do it directly. – Ahmad Mageed Jul 07 '09 at 05:30
@Ahmad: if you want the code, just use Reflector to see how Intersect is implemented. :) – JP Alioto Jul 07 '09 at 05:38
@JP: true, this answer and Reflector both crossed my mind initially :) – Ahmad Mageed Jul 07 '09 at 05:40
1

Or just look it up: "When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected." - http://msdn.microsoft.com/en-us/library/bb460136.aspx – rmoore Jul 07 '09 at 05:57
The OP states that they'd prefer to do this "without using extra space". The Intersect method creates and populates a Set behind the scenes (ie, it uses extra space). Having said that, this is an improvement on the original algorithm: it uses a Set rather than a Hashtable, it gets rid of the char arrays, and it removes the final "cleanup" loop altogether. – LukeH Jul 07 '09 at 08:54

Phil · Answer 2 · 2009-07-07T06:40:33.043

Haven't tested this, but here's my thought:

Quicksort both strings in place, so you have an ordered sequence of characters
Keeping an index into both strings, compare the "next" character from each string, pick and output the first one, incrementing the index for that string.
Continue until you get to the end of one of the strings, then just pull unique values from the rest of the remaining string.

Won't use additional memory, only needs the two original strings, two integers, and an output string (or StringBuilder). As an added bonus, the output values will be sorted too!

Part 2: This is what I'd write (sorry about the comments, new to stackoverflow):

private static string intersect(string left, string right)
{
  StringBuilder theResult = new StringBuilder();

  string sortedLeft = Program.sort(left);
  string sortedRight = Program.sort(right);

  int leftIndex = 0;
  int rightIndex = 0;

  //  Work though the string with the "first last character".
  if (sortedLeft[sortedLeft.Length - 1] > sortedRight[sortedRight.Length - 1])
  {
    string temp = sortedLeft;
    sortedLeft = sortedRight;
    sortedRight = temp;
  }

  char lastChar = default(char);
  while (leftIndex < sortedLeft.Length)
  {
    char nextChar = (sortedLeft[leftIndex] <= sortedRight[rightIndex]) ? sortedLeft[leftIndex++] : sortedRight[rightIndex++];

    if (lastChar == nextChar) continue;

    theResult.Append(nextChar);
    lastChar = nextChar;
  }

  //  Add the remaining characters from the "right" string
  while (rightIndex < sortedRight.Length)
  {
    char nextChar = sortedRight[rightIndex++];
    if (lastChar == nextChar) continue;

    theResult.Append(nextChar);
    lastChar = nextChar;
  }
  theResult.Append(sortedRight, rightIndex, sortedRight.Length - rightIndex);

  return (theResult.ToString());
}

I hope that makes more sense.

You could do better than quicksort, knowing that the data is characters, at least for extremely large data sets. If you are dealing with really hugestrings, you could do a library sort, and get O(n) performance, by basically creating a 255 char array. — Jeremy Salwen, Jul 07 '09 at 05:29
I didn't got "Keeping an index into both strings, compare the "next" character from each string, pick and output the first one, incrementing the index for that string." Can you shed some more light — Learner, Jul 07 '09 at 05:36
I meant something like this: char lastChar = default(char); while (leftIndex < sortedLeft.Length) { char nextChar = (rightIndex >= sortedRight.Length) || (sortedLeft[leftIndex] <= sortedRight[rightIndex]) ? sortedLeft[leftIndex++] : sortedRight[rightIndex++]; if (lastChar == nextChar) continue; theResult.Append(nextChar); lastChar = nextChar; } Hope that's clear. I've actually written it now, works OK EXCEPT I discovered that an in-place sort isn't possible with a .NET String - bummer. — Phil, Jul 07 '09 at 06:17
unknown: "by basically creating a 255 char array..." AFAIK, C# strings are UTF16, so a counting sort is not that easy to implement. With byte-sized chars, a bool[255] (or even 8 ints) would be a useful solution to the original problem, with minimal overhead. — fishlips, Jul 07 '09 at 06:55

score 1 · Answer 3 · answered Jul 07 '09 at 05:22

You don't need to 2 char arrays. The System.String data type has a built-in indexer by position that returns the char from that position, so you could just loop through from 0 to (String.Length - 1). If you're more interested in speed than optimizing storage space, then you could make a HashSet for the one of the strings, then make a second HashSet which will contain your final result. Then you iterate through the second string, testing each char against the first HashSet, and if it exists then add it the second HashSet. By the end, you already have a single HashSet with all the intersections, and save yourself the pass of running through the Hashtable looking for ones with a non-zero value.

EDIT: I entered this before all the comments on the question about not wanting to use any built-in containers at all

sounds good..but we may still want to iterate through the hashset to get those chars and convert it into string. — Learner, Jul 07 '09 at 05:26

Victor · Answer 4 · 2009-07-07T05:52:07.760

1

here's how I would do this. It's still O(N) and it doesn't use a hash table but instead one int array of length 26. (ideally)

make an array of 26 integers, each element for a letter of the alphebet. init to 0's.
iterate over the first string, decrementing one when a letter is encountered.
iterate over the second string and take the absolute of whatever is at the index corresponding to any letter you encounter. (edit: thanks to scwagner in comments)
return all letters corresponding to all indexes holding value greater than 0.

still O(N) and extra space of only 26 ints.

of course if you're not limited to only lower or uppercase characters your array size may need to change.

edited Jul 07 '09 at 05:52

answered Jul 07 '09 at 05:31

Victor

5,697
6
34
41

continuing from above example, c[a] = -2 after first pass then during pass of s2, we will flip the sign for first occurrence of 'a', and we will not flip the sign for 2nd occurrence of 'a' as it is already +ve........ – Learner Jul 07 '09 at 05:43
2

instead of "flipping the sign" in step 3, using Math.Abs would always keep the number positive, even in even-number-occurring repeating character counts for s2. – scwagner Jul 07 '09 at 05:43
Here instead of using array of ints we can use array of bools, with each index being the char ascii value, so we will require bool array of size 256. – Learner Jul 07 '09 at 08:15
1

It may not be an issue for the OP, but to cater for all possible unicode characters your array would need to be **a lot** bigger, which means that using a HashSet or similar would probably require less space overall. – LukeH Jul 07 '09 at 09:13
an array of bools wouldn't work here because you'd need one value for letters you've never seen, one value for one's you've seen in the first string, and another for the letters in both. so byte would probably be the most space efficient. but you'd have to make sure that when decrementing that you don't overflow. – Victor Jul 07 '09 at 15:15

score 0 · Answer 5 · answered Jul 07 '09 at 05:49

"with each letter represented at most once"

I'm assuming that this means you just need to know the intersections, and not how many times they occurred. If that's so then you can trim down your algorithm by making use of yield. Instead of storing the count and continuing to iterate the second string looking for additional matches, you can yield the intersection right there and continue to the next possible match from the first string.

Implementing an efficent algorithm to find the intersection of two strings

5 Answers5

Linked