Positions of substrings in string

Question

I need to know all the positions of a word in a text - substring in string. The solution so far is to use a regex, but I am not sure if there not better, may builtin standard library strategies. Any ideas?

import re

text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
links = {'fox': [], 'dog': []}
re_capture = u"(^|[^\w\-/])(%s)([^\w\-/]|$)" % "|".join(links.keys())

iterator = re.finditer(re_capture, text)

if iterator:
    for match in iterator:

        # fix position by context 
        # (' ', 'fox', ' ')
        m_groups = match.groups()
        start, end = match.span()
        start = start + len(m_groups[0])
        end = end - len(m_groups[2])

        key = m_groups[1]
        links[key].append((start, end))

print links

{'fox': [(16, 19), (45, 48)], 'dog': [(40, 43)]}

Edit: Partial words are not allowed to match - see fox of Redfox is not in links.

Thanks.

duplicate of http://stackoverflow.com/questions/3437059/does-python-have-a-string-contains-substring-method — R Nar, Oct 02 '15 at 22:39
@RNar It's not a dup cause the OP looks for *all* occurrences. — Nir Alfasi, Oct 02 '15 at 22:57
why is your regex so complicated? Also re is part of the standard lib — Padraic Cunningham, Oct 02 '15 at 23:06
Ok, than without regex?! The regex is so complicate because the context of the substring is relevant later. — rebeling, Oct 02 '15 at 23:11
If you are matching very complicated patterns the find is not going to work,`"foxes".find("fox")` will find a match for fox, if that is ok then your regex is completely over the top — Padraic Cunningham, Oct 02 '15 at 23:18
An that is not allowed see edit in text **redfox** will / should not be recognized. — rebeling, Oct 02 '15 at 23:23

Marcos Castro · Answer 1 · 2015-10-02T23:13:35.457

3

not as pythonic and without regex:

text = "The quick brown fox jumps over the lazy dog. fox."
links = {'fox': [], 'dog': []}

for key in links:
    pos = 0
    while(True):
        pos = text.find(key, pos)
        if pos < 0:
            break
        links[key].append((pos, pos + len(key)))
        pos = pos + 1
print(links)

edited Oct 02 '15 at 23:13

answered Oct 02 '15 at 23:00

Marcos Castro

41
6

1

I like your code, would you please edit to indent your entire set of code by four spaces? Also if you would change `for link in links` to `for key in links` to match normal dictionary handling, that would be great. – Alea Kootz Oct 02 '15 at 23:07
Partial words are not allowed to match - see Redfox. – rebeling Oct 02 '15 at 23:28
Your code does not work in my case - to many conditions for the match. Thanks for your effort. – rebeling Oct 03 '15 at 23:43

Padraic Cunningham · Accepted Answer · 2015-10-02T23:32:25.960

If you want to match actual words and your strings contain ascii:

text = "fox The quick brown fox jumps over the fox! lazy dog. fox!."
links = {'fox': [], 'dog': []}

from string import punctuation
def yield_words(s,d):
    i = 0
    for ele in s.split(" "):
        tot = len(ele) + 1
        ele = ele.rstrip(punctuation)
        ln = len(ele)
        if ele in d:
            d[ele].append((i, ln + i))
        i += tot
    return d

This unlike the find solution won't match partial words and does it in O(n) time:

In [2]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."

In [3]: links = {'fox': [], 'dog': []}

In [4]: yield_words(text,links)
Out[4]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}

This is probably one case where a regex is a good approach, it can just be much simpler:

def reg_iter(s,d):
    r = re.compile("|".join([r"\b{}\b".format(w) for w in d]))
    for match in r.finditer(s):
        links[match.group()].append((match.start(),match.end()))
    return d

Output:

In [6]: links = {'fox': [], 'dog': []}

In [7]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."

In [8]: reg_iter(text, links)
Out[8]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}

So far, your answer is my most preferred one - the reg_iter - is shorter, a bit faster and it solves an edge case that is not even mentioned in my question: while I am processing lot of text with german umlauts your code just worked on this too. — rebeling, Oct 03 '15 at 23:38
Rating and explanation will be added soon - may there will be something else put on the table we both never dreamed of, thanks for your answer ;) — rebeling, Oct 03 '15 at 23:39

Positions of substrings in string

2 Answers2