3

I am trying to find all occurrences of sub-strings in a main string (of all lengths). My function takes one string and then returns a dictionary of every sub-string (which occurs more than once, of course) and how many times it occurs (format of the dictionary: {substring: # of occurrences, ...}). I am using collections.Counter(s) to help me with it.

Here is my function:

from collections import Counter

def patternFind(s):
    patterns = {}
    for index in range(1, len(s)+1)[::-1]:
        d = nChunks(s, step=index)
        parts = dict(Counter(d))
        patterns.update({elem: parts[elem] for elem in parts.keys() if parts[elem] > 1})
    return patterns

def nChunks(iterable, start=0, step=1):
    return [iterable[i:i+step] for i in range(start, len(iterable), step)]

I have a string, data with about 2500 random letters (in a random order). However, there are 2 strings inserted into it (random points). Say this string is 'TEST'. data.count('TEST') returns 2. However, patternFind(data)['TEST'] gives me a KeyError. Therefore, my program does not detect the two strings in it.

What have I done wrong? Thanks!

Edit: My method of creating testing-instances:

def createNewTest():
    n = randint(500, 2500)
    x, y = randint(500, n), randint(500, n)
    s = ''
    for i in range(n):
        s += choice(uppercase)
        if i == x or i == y: s += "TEST"
    return s
Rushy Panchal
  • 16,979
  • 16
  • 61
  • 94
  • Can you show how you are inserting "TEST" into your string? Your code seems to work perfectly for me. I created a test string with `s = [random.choice(string.letters+string.digits) for i in range(2500)]` then inserted "TEST" at a couple of places using `s[ind:ind+4] = "TEST"` and joined the list into a single string with `s = ''.join(s)`. Calling `out = patternFind(s)` gave me a dict where `out['TEST'] == 2`. – Vorticity Apr 06 '13 at 01:16
  • @Vorticity Edited my code with how I am creating the string. I tried it with 100 strings, and only 29 were correct. Sometimes it works, and sometimes it doesn't. – Rushy Panchal Apr 06 '13 at 01:28
  • I guess my first reply was wrong. The code doesn't work as is. I'm out of time on this for now, but at least I can give you a hint. Start using a MUCH shorter string (say `n = randint(20, 40)`) and print `parts` every time through the loop. You will find that you are not looking at every possibility. Your find routine is about to get a lot more complicated and I think you'd be better off trying to go back to the original problem and figure out what it is that you are TRULY looking for. – Vorticity Apr 06 '13 at 01:51
  • Or, it looks like mezzomondo may have a good answer for you. – Vorticity Apr 06 '13 at 01:52
  • @Vorticity Yeah it's because of the starting point of `nChunks`. I created a way to find ALL substrings, but for a 2500-character file, it had 3.1M or so substrings (some may have been duplicates), which inevitably slowed my program down. – Rushy Panchal Apr 06 '13 at 01:55
  • Perhaps I am missing something. Are you only trying to generate a dictionary with all the substrings as keys? Or only those which occur more than once? Why are you not simply constructing this dictionary in a forward fashion? In terms of the algorithmics, it doesn't make sense to use finds or regexs. – Bitwise Apr 06 '13 at 02:45
  • Is your `createNewTest()` the actual code you use? Because it can generate a string as small as 500 characters. Also, using `+=` to generate string longer than a few pieces is *very* slow. – Ethan Furman Apr 07 '13 at 21:24
  • @EthanFurman I was just using that to make test files in large amounts (100 or so). But yes, that was my actual code. – Rushy Panchal Apr 07 '13 at 21:55
  • 2
    +1 for provoking this "contest" :) – daedalus Apr 11 '13 at 17:22

4 Answers4

4

Using Regular Expressions

Apart from the count() method you described, regex is an obvious alternative

import re

needle = r'TEST'

haystack = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklagh'
pattern = re.compile(needle)

print len(re.findall(pattern, haystack))

Short Cut

If you need to build a dictionary of substrings, possibly you can do this with only subset of those strings. Assuming you know the needle you are looking for in the data then you only need the dictionary of substrings of data that are the same length of needle. This is very fast.

from collections import Counter

needle = "TEST"

def gen_sub(s, len_chunk):
    for start in range(0, len(s)-len_chunk+1):
        yield s[start:start+len_chunk]

data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
parts = Counter([sub for sub in gen_sub(data, len(needle))])

print parts[needle]

Brute Force: building dictionary of all substrings

If you need to have a count of all possible substrings, this works but it is very slow:

from collections import Counter

def gen_sub(s):
    for start in range(0, len(s)):
        for end in range(start+1, len(s)+1):
            yield s[start:end]

data = 'khjkzahklahjTESTkahklaghTESTjklajhkhz'
parts = Counter([sub for sub in gen_sub(data)])

print parts['TEST']

Substring generator adapted from this: https://stackoverflow.com/a/8305463/1290420

Community
  • 1
  • 1
daedalus
  • 10,873
  • 5
  • 50
  • 71
  • It takes very long for the 2500-letter string, though. – Rushy Panchal Apr 03 '13 at 14:34
  • I want to find the patterns of every substring in the main string (only given the main string). I don't necessarily need to do it brute-force, but that's how I thought of doing it originally. If there is another way, please tell me about it. – Rushy Panchal Apr 03 '13 at 15:16
  • Added two variants that are faster than the brute force method. Hope this helps. – daedalus Apr 03 '13 at 15:52
  • The thing is, I don't know the "needle". I tried doing `d = {}`, then `for i in range(1, len(data)+1): parts.update(dict(Counter([sub for sub in gen_sub(data, i)])))` but this takes very long and actually froze my computer. – Rushy Panchal Apr 03 '13 at 17:22
  • 8
    Finding all possible substrings, of all possible lengths, in a string of 2500 characters **is** going to take an inordinate length of time. You may need to go back to the analysis of the problem and choose a different way of solving it. I will leave my answer here to show others what _doesn't_ work for you. – daedalus Apr 03 '13 at 18:11
3

The problem is in your nChunks function. It does not give you all the chunks that are necessary.

Let's consider a test string:

s='1test2345test'

For the chunks of size 4 your nChunks function gives this output:

>>>nChunks(s, step=4)
['1tes', 't234', '5tes', 't']

But what you really want is:

>>>def nChunks(iterable, start=0, step=1):
    return [iterable[i:i+step] for i in range(len(iterable)-step+1)]
>>>nChunks(s, step=4)
['1tes', 'test', 'est2', 'st23', 't234', '2345', '345t', '45te', '5tes', 'test']

You can see that this way there are two 'test' chunks and your patternFind(s) will work like a charm:

>>> patternFind(s)
{'tes': 2, 'st': 2, 'te': 2, 'e': 2, 't': 4, 'es': 2, 'est': 2, 'test': 2, 's': 2}
jurgenreza
  • 5,856
  • 2
  • 25
  • 37
  • Elegant, but no different in essence than my short cut function. The problem is only partly the generation of substrings; the real issue is how long it takes with a string of 2500 characters. – daedalus Apr 06 '13 at 04:11
  • @gauden I hadn't read the comments closely. I focused on what was asked in the question; There, you ask "What have I done wrong?" and there is no mention of the speed. Maybe you should create another question addressing the size and speed of the operation. – jurgenreza Apr 06 '13 at 04:18
  • @gauden sorry I thought you are the OP. :) – jurgenreza Apr 06 '13 at 04:20
  • @jurgenreza This is the fastest method (that I've found)... for the 2500-character file, took 19 seconds, which is significantly better than the rest... Thanks! – Rushy Panchal Apr 06 '13 at 14:54
3

While jurgenreza has explained why your program didn't work, the solution is still quite slow. If you only examine substrings s for which you know that s[:-1] repeats, you get a much faster solution (typically a hundred times faster and more):

from collections import defaultdict

def pfind(prefix, sequences):
    collector = defaultdict(list)
    for sequence in sequences:
        collector[sequence[0]].append(sequence)
    for item, matching_sequences in collector.items():
        if len(matching_sequences) >= 2:
            new_prefix = prefix + item
            yield (new_prefix, len(matching_sequences))
            for r in pfind(new_prefix, [sequence[1:] for sequence in matching_sequences]):
                yield r

def find_repeated_substrings(s):
    s0 = s + " "
    return pfind("", [s0[i:] for i in range(len(s))])

If you want a dict, you call it like this:

result = dict(find_repeated_substrings(s))

On my machine, for a run with 2247 elements, it took 0.02 sec, while the original (corrected) solution took 12.72 sec.

(Note that this is a rather naive implementation; using indexes of instead of substrings should be even faster.)

Edit: The following variant works with other sequence types (not only strings). Also, it doesn't need a sentinel element.

from collections import defaultdict

def pfind(s, length, ends):
    collector = defaultdict(list)
    if ends[-1] >= len(s):
        del ends[-1]
    for end in ends:
        if end < len(s):
            collector[s[end]].append(end)
    for key, matching_ends in collector.items():
        if len(matching_ends) >= 2:
            end = matching_ends[0]
            yield (s[end - length: end + 1], len(matching_ends))
            for r in pfind(s, length + 1, [end + 1 for end in matching_ends if end < len(s)]):
                yield r


def find_repeated_substrings(s):
    return pfind(s, 0, list(range(len(s))))

This still has the problem that very long substrings will exceed recursion depth. You might want to catch the exception.

Reinstate Monica
  • 4,568
  • 1
  • 24
  • 35
  • This is much faster, thanks! Can you explain how each step works, though? – Rushy Panchal Apr 11 '13 at 16:03
  • 1
    @WolframH - This is by far the fastest solution and I am awarding the bounty here. The recursive solution completely re-defines the problem and is most elegant. One downside: if the sub-string occurs very frequently in the main string, then the function will exceed recursion depth and crash; the exception should be caught and either fail gracefully or revert to one of the slower solutions. You may want to note this in the answer? Congrats again! – daedalus Apr 11 '13 at 17:20
  • I'm still not sure how it works (I haven't worked with generators often); I tried changing to to work with lists (converting it to a tuple) but I couldn't get it to work with that. Could you add a subsection with how it would work for lists (or any non-string iterable object)? Thanks :3 – Rushy Panchal Apr 11 '13 at 18:02
2

here you can find a solution that uses a recursive wrapper around string.find() that searches all the occurences of a substring in a main string. The collectallchuncks() function returns a defaultdict whith all the substrings as keys and for each substring a list of all the indexes where the substring is found in the main string.

import collections

# Minimum substring size, may be 1
MINSIZE = 3

# Recursive wrapper
def recfind(p, data, pos, acc):
    res = data.find(p, pos)
    if res == -1:
        return acc
    else:
        acc.append(res)
        return recfind(p, data, res+1, acc)

def collectallchuncks(data):
    res = collections.defaultdict(str)
    size = len(data)
    for base in xrange(size):
        for seg in xrange(MINSIZE, size-base+1):
            chunk = data[base:base+seg]
            if data.count(chunk) > 1:
                res[chunk] = recfind(chunk, data, 0, [])
    return res

if __name__ == "__main__":
    data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
    allchuncks = collectallchuncks(data)
    print 'TEST', allchuncks['TEST']
    print 'hklag', allchuncks['hklag']

EDIT: If you just need the number of occurrences of each substring in the main string you can easily obtain it getting rid of the recursive function:

import collections

MINSIZE = 3

def collectallchuncks2(data):
    res = collections.defaultdict(str)
    size = len(data)
    for base in xrange(size):
        for seg in xrange(MINSIZE, size-base+1):
            chunk = data[base:base+seg]
            cnt = data.count(chunk)
            if cnt > 1:
                res[chunk] = cnt
    return res

if __name__ == "__main__":
    data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
    allchuncks = collectallchuncks2(data)
    print 'TEST', allchuncks['TEST']
    print 'hklag', allchuncks['hklag']
mezzomondo
  • 345
  • 2
  • 7
  • This is similar to what gauden had except in one function. For a file with about 2500 words, it took 46 seconds, which is pretty long. – Rushy Panchal Apr 06 '13 at 01:52
  • 3
    I'm a little skeptical that you are going to find a faster method using python. This might be a point where you want to drop to C for speed or re-think what it is that you need. – Vorticity Apr 06 '13 at 02:37
  • @Vorticity: see my answer; it can be much faster even in python. – Reinstate Monica Apr 11 '13 at 12:01
  • That's a fantastic answer. Thanks for reminding me to come back to this question and congrats on the bounty. – Vorticity Apr 15 '13 at 18:57