finding needle in haystack, what is a better solution?

Question

so given "needle" and "there is a needle in this but not thisneedle haystack"

I wrote

def find_needle(n,h):
    count = 0
    words = h.split(" ")
    for word in words:
        if word == n:
            count += 1
    return count

This is O(n) but wondering if there is a better approach? maybe not by using split at all?

How would you write tests for this case to check that it handles all edge cases?

Every solution will be `O(n)` because you have to search the entire string. Though you can still improve performance by removing allocations, etc. — Colonel Thirty Two, Apr 22 '15 at 23:39
@ColonelThirtyTwo how cna I remove such allocation? the memory space is important here — user299709, Apr 22 '15 at 23:44
The only way I can think of to do better is to accept a stream as your "h" parameter instead of a string. For that solution, you would parse the incoming stream for needle and discard any part of the stream you have read. Your time performance is still O(n) but your memory consumption would be a lot less. — I-Lin Kuo, Apr 28 '15 at 18:16

score 12 · Accepted Answer · answered Apr 28 '15 at 14:02

I don't think it's possible to get bellow O(n) with this (because you need to iterate trough the string at least once). You can do some optimizations.

I assume you want to match "whole words", for example looking up foo should match like this:

foo and foo, or foobar and not foo.
^^^     ^^^                    ^^^

So splinting just based on space wouldn't do the job, because:

>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
#                  ^                                     ^

This is where re module comes in handy, which will allows you to build fascinating conditions. For example \b inside the regexp means:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

So r'\bfoo\b' will match only whole word foo. Also don't forget to use re.escape():

>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'

All you have to do now is use re.finditer() to scan the string. Based on documentation:

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them:

>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3

Jérôme · Answer 2 · 2015-04-22T23:51:17.083

5

This does not address the complexity issue but simplifies the code:

def find_needle(n,h):
    return h.split().count(n)

edited Apr 22 '15 at 23:51

answered Apr 22 '15 at 23:39

Jérôme

13,328
7
56
106

Pedro Lobito · Answer 3 · 2015-05-03T19:32:24.660

4

You can use Counter

from collections import Counter

def find_needle(n,h):
    return Counter(h.split())[n]

i.e.:

n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'

print find_needle(n,h)

Output:

DEMO

edited May 03 '15 at 19:32

answered Apr 22 '15 at 23:44

Pedro Lobito

94,083
31
258
268

1

Your post looked spamish (like self-promotion) at first sight. On second though, it is not. And it is kinda funny. I undid the down vote. Your 10k rep should have been a clue. Anyway, can you elaborate on the interest of `Counter` here as opposed to the `count()` function? I don't see any benefit and you don't mention it explicitly. – Jérôme Apr 23 '15 at 07:34
@jerome There's no big advantage in regards to your solution, it's just a different appoach and a way for the OP to know Counter, that may be usefull for futures problems. Tks for removing the down-vote. – Pedro Lobito Apr 23 '15 at 07:53
Why not just `return Counter(h.split())[n]`? – Stefan Pochmann May 03 '15 at 18:19
@stefan I guess you can also do that to save 1 line of code. – Pedro Lobito May 03 '15 at 18:23
the posted DEMO isn't the same code - anyway, neither worked for me in python 3.6 aws lambda. [this solution did work for me](https://stackoverflow.com/questions/29810883/finding-needle-in-haystack-what-is-a-better-solution#answer-29811635) :point-down: – WEBjuju Mar 23 '21 at 21:14

score 3 · Answer 4 · edited May 23 '17 at 12:17

3

Actually, when you say O(n) you are forgetting the fact that after matching the first letter, you have to match the remaining ones as well (match n from needle to sentence, then match e, then the next e...) You are essentially trying to replicate the functionality of grep, so you can look at the grep algorithm. You can do well by building a finite state machine. There are many links that can help you, for one you could start from How does grep run so fast?

edited May 23 '17 at 12:17

Community

1
1

answered Apr 30 '15 at 07:34

VBB

1,305
7
17

Looks like grep is the way to go. Never realized it could achieve sub linear time. Pretty cool. – Kyle G Apr 30 '15 at 17:47
See this question (http://stackoverflow.com/questions/1106112/improving-boyer-moore-string-search) for a discussion of the Boyer-Moore algorithm, which is how grep achieves its speed. – seaotternerd May 03 '15 at 19:53

Kyle G · Answer 5 · 2015-04-23T00:59:52.747

2

This is still going to be O(n) but it uses the power of the re module and python's generator expressions.

import re

def find_needle(n,h):
    g = re.finditer(r'\b%s\b'%n, h)  # use regex word boundaries
    return sum(1 for _ in g)  # return the length of the iterator

Should use far less memory than .split for a relatively large 'haystack'.

Note that this is not exactly the same as the code in the OP because it will not only find 'needle' but also 'needle,' and 'needle.' It will not find 'needles' though.

edited Apr 23 '15 at 00:59

answered Apr 23 '15 at 00:51

Kyle G

1,017
11
18

This won't work if the string contains special regex characters like `.`. Use `re.escape` on the needle first. – Colonel Thirty Two Apr 23 '15 at 13:03
this works in python 3.6 aws lambda. thx – WEBjuju Mar 23 '21 at 21:15

score 0 · Answer 6 · answered May 03 '15 at 17:34

If you are concerned with the time it takes (as distinct from time complexity) multiprocess it. Basically make n smaller. Here is an example to run it in 2 processes.

from multiprocessing import Process

def find(word, string):
    return string.count(word)

def search_for_words(word, string):
    full_length = len(string)
    part1 = string[:full_length/2]
    proc1 = Process(target=find, args=(word, part1,))
    proc1.start()
    part2 = string[full_lenght/2:]
    proc2 = Process(target=find, args=(word, part2,))
    proc2.start()
    proc1.join()
    proc2.join()

if its O(n) you are worried about - then, i'm not sure there is much you can do, unless it is possible to get the string in another data structure. like a set or something. (but putting it in that set is also O(n), you can save on time if you are already iterating over the string somewhere else, and then make this structure then. write once, read many.

score 0 · Answer 7 · answered May 04 '15 at 21:09

0

In order to guarantee finding a needle in a haystack, you need to examine each piece of hay until you find the needle. This is O(n) no matter what, a tight lower bound.

answered May 04 '15 at 21:09

Mitchell Carroll

479
5
13

score 0 · Answer 8 · answered Oct 31 '17 at 06:08

0

def find_needle(haystack):
    for item in haystack:
        if item  == 'needle':
            haystack.append(item)
            return 'found the needle at position ' + str(haystack.index(item))

answered Oct 31 '17 at 06:08

Rain

11
4

Please try to address the entirety of the question - What is the time complexity of this/what makes this a *better* solution, and how would you write tests around this? – Jonathan Holland Oct 31 '17 at 06:28

score 0 · Answer 9 · answered Aug 14 '19 at 06:59

0

here's my one.

def find_needle(haystack, needle):
    return haystack.count(needele)

here, we simply use the built-in count method to count the number of needles in a haystack.

answered Aug 14 '19 at 06:59

Anu

21
5

finding needle in haystack, what is a better solution?

9 Answers9