What's the best way to find the intersection between two strings?

Question

I need to find the intersection between two strings. Assertions:

assert intersect("test", "tes") == list("tes"), "Assertion 1"
assert intersect("test", "ta") == list("t"), "Assertion 2"
assert intersect("foo", "fo") == list("fo"), "Assertion 3"
assert intersect("foobar", "foo") == list("foo"), "Assertion 4"

I tried different implementations for the intersect function. intersect would receive 2 str parameters, w and w2

List comprehension. Iterate and look for occurrences in the second string.

return [l for l in w if l in w2]

Fail assertion 1 and 2 because multiple t in w match the one t in w2

Sets intersections.

return list(set(w).intersection(w2)
return list(set(w) & set(w2))

Fails assertion 3 and 4 because a set is a collection of unique elements and duplicated letters will be discarded.

Iterate and count.

out = ""
for c in s1:
    if c in s2 and not c in out:
        out += c
return out

Fails because it also eliminates duplicates.

difflib (Python Documentation)

letters_diff = difflib.ndiff(word, non_wildcards_letters)
letters_intersection = []

for l in letters_diff:
    letter_code, letter = l[:2], l[2:]
    if letter_code == "  ":
        letters_intersection.append(letter)

return letters_intersection

Passes

difflib works but can anybody think of a better, optimized approach?

EDIT: The function would return a list of chars. The order doesn't really matter.

What's the expected output of `intersection('aba', 'aca')`? `['a']` or `['a', 'a']`? What about `intersection('ab', 'b')`? `[]` or `['b']`? — Aran-Fey, Aug 08 '18 at 14:26
Are these just prefixes? If not, does the order actually matter or just the counts? — Reut Sharabani, Aug 08 '18 at 14:30
Possible duplicate of [Find common substring between two strings](https://stackoverflow.com/questions/18715688/find-common-substring-between-two-strings) — Thierry Lathuille, Aug 08 '18 at 14:32
@Aran-Fey `intersection('aba', 'aca')` should return `['a', 'a']` and `intersection('ab', 'b')` just `['b']` yes. @ReutSharabani , edited the question. No the order doesn't matter. — maxrodrigo, Aug 08 '18 at 14:56
Just to make sure, but `intersection('ab', 'ba')` would be either `['a']` or `['b']`, correct? — Aran-Fey, Aug 08 '18 at 15:16
@Aran-Fey `intersection('ab', 'ba')` would be either `['a', 'b']` or `['b', 'a']` — maxrodrigo, Aug 08 '18 at 15:20
ok, so if I get it, also both `intersection('abac', 'abhac')` and `intersection('acab', 'abhac')` will give `['ab', 'ac']`? the output is actually a set if the order doesn't matter, but then the output `['a', 'a']` is not possible... A counter `{'a':2}`? — xdze2, Aug 08 '18 at 15:22
Oh, so basically each string is treated as a multiset. That's easy, then. — Aran-Fey, Aug 08 '18 at 15:22
Dupe: [Python list intersection with non unique items](//stackoverflow.com/q/12253361) — Aran-Fey, Aug 08 '18 at 15:25
@Aran-Fey: but for every possible n-gram, not only individual char, no? — xdze2, Aug 08 '18 at 15:32
@xdze2 I don't think so. If that were the case, the output of `intersect("foo", "fo")` would be `['fo']` or `['f', 'o', 'fo']` rather than `['f', 'o']`, no? — Aran-Fey, Aug 08 '18 at 15:54
@Aran-Fey: yes, so the output could be for example the sorted concatenation of the intersection of the counter of all individual chars. It is coherent with the proposed tests at the top of the question, but not with the following discussion in the comments... I think some clarification is needed — xdze2, Aug 08 '18 at 16:05
Please could you accept my answer if it works for you? (so that people in the future who encounter your problem will know which solution works for you) — Adi219, Aug 13 '18 at 11:46

score 3 · Accepted Answer · edited Aug 08 '18 at 15:40

3

Try this:

def intersect(string1, string2): 
    common = []
    for char in set(string1):
        common.extend(char * min(string1.count(char), string2.count(char)))

    return common

Note: It doesn't preserve the order (if I remember set() correctly, the letters will be returned in alphabetical order). But, as you say in your comments, order doesn't matter

edited Aug 08 '18 at 15:40

Aran-Fey

39,665
11
104
149

answered Aug 08 '18 at 15:05

Adi219

4,712
2
20
43

@Aran-Fey Edited :) – Adi219 Aug 08 '18 at 15:27
`list(set(list(str1)))` can be just `list(set(str1))`. You have a typo in the `else` branch (`string` instead of `string2`). I'm pretty sure there's no reason to have the whole `if ... else` thing in the first place. And finally, repeatedly calling `str.count` on both strings needlessly increases your run time complexity. You can use `collections.Counter` to count all the characters just once. (And your code had syntax errors before the edit.) – Aran-Fey Aug 08 '18 at 15:30
`min(string1.count(char), str2.count(char))` <- that's a bug. You should be using `str1` there. – Aran-Fey Aug 08 '18 at 15:33
@Aran-Fey The if-else is needed to get the string of minimum length. I wouldn't need it if I knew for sure that the strings were of different length, but as they *could* be, I'm using the if-else. – Adi219 Aug 08 '18 at 15:33
@Aran-Fey No that's correct. This is because the set() removes duplicates but I need to count duplicate characters when counting. – Adi219 Aug 08 '18 at 15:34
@Aran-Fey I've never used collections.Counter so I hope you'll understand if I'm not comfortable using it here :) – Adi219 Aug 08 '18 at 15:36
No, that's a bug. Try `intersect('abb', 'b')`. – Aran-Fey Aug 08 '18 at 15:37
@Aran-Fey I'm on my phone right now so I can't try it. However, I know it should return ['b']. What *does* it return? – Adi219 Aug 08 '18 at 15:38
It returns `['b', 'b']`. I'll fix it for you if you don't mind. (If you do mind, roll back my edit.) – Aran-Fey Aug 08 '18 at 15:40
@Aran-Fey Nah, that's fine! Thanks :) – Adi219 Aug 08 '18 at 15:41

score 1 · Answer 2 · edited Aug 08 '18 at 18:34

1

This works pretty well for your test cases:

def intersect(haystack, needle):
    while needle:
        pos = haystack.find(needle)
        if pos >= 0:
            return list(needle)
        needle = needle[:-1]
    return []

But, bear in mind that, all your test cases are longer then shorter, do not have an empty search term, an empty search space, or a non-match.

edited Aug 08 '18 at 18:34

Adi219

4,712
2
20
43

answered Aug 08 '18 at 14:44

Jerub

41,746
15
73
90

1

Unfortunately the test cases don't come anywhere close to covering all the requirements. If you read the comments below the question you should see that this doesn't do what the OP wants. – Aran-Fey Aug 08 '18 at 15:21

score 0 · Answer 3 · answered Aug 08 '18 at 15:53

Gives the number of co-occurrence for all n-grams in the two strings:

from collections import Counter

def all_ngrams(text):
    ngrams = ( text[i:i+n] for n in range(1, len(text)+1)
                           for i in range(len(text)-n+1) )
    return Counter(ngrams)

def intersection(string1, string2):
    count_1 = all_ngrams(string1)
    count_2 = all_ngrams(string2)
    return count_1 & count_2   # intersection:  min(c[x], d[x])

intersection('foo', 'f') # Counter({'f': 1})
intersection('foo', 'o') # Counter({'o': 1})
intersection('foobar', 'foo') # Counter({'f': 1, 'fo': 1, 'foo': 1, 'o': 2, 'oo': 1})
intersection('abhab', 'abab') # Counter({'a': 2, 'ab': 2, 'b': 2})
intersection('achab', 'abac') # Counter({'a': 2, 'ab': 1, 'ac': 1, 'b': 1, 'c': 1})
intersection('test', 'ates') # Counter({'e': 1, 'es': 1, 's': 1, 't': 1, 'te': 1, 'tes': 1})

What's the best way to find the intersection between two strings?

3 Answers3