Matching Subset in a String

Question

Let's say I have-

String x = "ab";
String y = "xypa";

If I want to see if any subset of x exists in y, what would be the fastest way? Looping is time consuming. In the example above a subset of x is "a" which is found in y.

score 2 · Answer 1 · answered Oct 14 '10 at 17:14

The answer really depends on many things.

If you just want to find any subset and you're doing this only once, looping is just fine (and the best you can do without using additional storage) and you can stop when you find a single character that matches.

If you have a fixed x and want to use it for matching several strings y, you can do some pre-processing to store the characters in x in a table and use this table to check if each character of y occurs in x or not.

If you want to find the largest subset, then you're looking at a different problem: the longest common subsequence problem.

score 1 · Answer 2 · answered Oct 14 '10 at 17:06

1

Well, I'm not sure it's better than looping, but you could use String#matches:

if (y.matches(".*[" + x + "]+.*")) ...

You'd need to escape characters that are special in a regex [] construct, though (like ], -, \, ...).

The above is just an example, if you're doing it more than once, you'll want to use Pattern, Matcher, and the other stuff from the java.util.regex package.

answered Oct 14 '10 at 17:06

T.J. Crowder

1,031,962
187
1,923
1,875

Sorry, error in the original regex (`matches` looks for a complete match for the entire string). Fixed. – T.J. Crowder Oct 14 '10 at 17:10
@Eugene: It deals with the case he listed. But if `x` is "abc" and he wants to match "ab", indeed, this is not the answer. Might be part of it, though. – T.J. Crowder Oct 14 '10 at 21:46

Pops · Answer 3 · 2010-10-14T19:06:17.377

Looping is time-consuming, but there's no way to do what you want other than going over the target string repeatedly.

What you can do is optimize by checking the smallest strings first, and work your way up. For example, if the target string doesn't contain abc, it can't possibly contain abcdef.

Other optimizations off the top of my head:

Don't continue to check for a match after a non-matching character is hit, though in Java you can let the computer worry about this.
Don't check to see if something is a match if there aren't enough characters left in the target string for a match to be possible.
If you need speed and have lots of space, you might be able to break the target string up into a fancy data structure like a trie for better results, though I don't have an exact algorithm in mind.
Another storage-is-not-a-problem solution: decompose the target into every possible substring and store the results in a HashSet.

score 0 · Answer 4 · answered Oct 14 '10 at 17:11

0

You have to use for loop or use regex which is just as expensive as a for loop, becasue you need to convert one of your strings into chars basically.

Boolean isSubset = false;
for(int i = 0; i < x.length(); i++) {
        if(y.contains(x.charAt(i))) {
                isSubset = true;
                break;
        }
}

using a for loop.

answered Oct 14 '10 at 17:11

Jacob Nelson

2,946
5
33
41

*"...which is just as expensive as a for loop..."* Using a regex A) At least leaves open the possibility the library implementation in a given runtime may be able to do better (e.g., native calls, whatever), and B) Leverages code that's already written. It may well be that it's just as slow (or slower!) than looping yourself. Or not. – T.J. Crowder Oct 14 '10 at 17:13
... All I was saying was if you do 1million for loops like this or 1million regex that the cost for the application to run would be about the same. – Jacob Nelson Oct 14 '10 at 17:20

score 0 · Answer 5 · answered Oct 14 '10 at 17:13

0

It looks like this could be a case of the longest common substring problem.

answered Oct 14 '10 at 17:13

Nick

2,827
4
29
39

score 0 · Answer 6 · answered Oct 14 '10 at 17:16

0

You can generate all subsets of x (e.g. , in your example, ab, a, b) and then generate a regexp that would do the

  Pattern p = Pattern.compile("(ab|a|b)");
  Matcher m = p.matcher(y);
  if(m.find()) {
    System.err.println(m.group());
  }

answered Oct 14 '10 at 17:16

Eugene Kuleshov

31,461
5
66
67

score 0 · Answer 7 · answered Oct 14 '10 at 17:28

If both Strings will only contain [a-z]. Then fastest would be to make two bitmaps, 26 bits longs. Mark all the bits contained in the String. Take the AND of the bitmaps, the resulting bits are present in both Strings, the largest common subset. This would be a simple O(n) with n the length of the biggest String.

(If you want to cover the whole lot of UTF, bloom filters might be more appropriate. )

score 0 · Answer 8 · answered Oct 14 '10 at 20:22

What about this:?

package so3935620;

import static org.junit.Assert.*;

import java.util.BitSet;

import org.junit.Test;

public class Main {

  public static boolean overlap(String s1, String s2) {
    BitSet bs = new BitSet();
    for (int i = 0; i < s1.length(); i++) {
      bs.set(s1.charAt(i));
    }
    for (int i = 0; i < s2.length(); i++) {
      if (bs.get(s2.charAt(i))) {
        return true;
      }
    }
    return false;
  }

  @Test
  public void test() {
    assertFalse(overlap("", ""));
    assertTrue(overlap("a", "a"));
    assertFalse(overlap("abcdefg", "ABCDEFG"));
  }
}

And if that version is too slow, you can compute the BitSet depending on s1, save that in some variable and later only loop over s2.

Matching Subset in a String

8 Answers8

Linked