1

I need to know all the positions of a word in a text - substring in string. The solution so far is to use a regex, but I am not sure if there not better, may builtin standard library strategies. Any ideas?

import re

text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
links = {'fox': [], 'dog': []}
re_capture = u"(^|[^\w\-/])(%s)([^\w\-/]|$)" % "|".join(links.keys())

iterator = re.finditer(re_capture, text)

if iterator:
    for match in iterator:

        # fix position by context 
        # (' ', 'fox', ' ')
        m_groups = match.groups()
        start, end = match.span()
        start = start + len(m_groups[0])
        end = end - len(m_groups[2])

        key = m_groups[1]
        links[key].append((start, end))

print links

{'fox': [(16, 19), (45, 48)], 'dog': [(40, 43)]}

Edit: Partial words are not allowed to match - see fox of Redfox is not in links.

Thanks.

rebeling
  • 718
  • 9
  • 31

2 Answers2

3

not as pythonic and without regex:

text = "The quick brown fox jumps over the lazy dog. fox."
links = {'fox': [], 'dog': []}

for key in links:
    pos = 0
    while(True):
        pos = text.find(key, pos)
        if pos < 0:
            break
        links[key].append((pos, pos + len(key)))
        pos = pos + 1
print(links)
  • 1
    I like your code, would you please edit to indent your entire set of code by four spaces? Also if you would change `for link in links` to `for key in links` to match normal dictionary handling, that would be great. – Alea Kootz Oct 02 '15 at 23:07
  • Partial words are not allowed to match - see Redfox. – rebeling Oct 02 '15 at 23:28
  • Your code does not work in my case - to many conditions for the match. Thanks for your effort. – rebeling Oct 03 '15 at 23:43
1

If you want to match actual words and your strings contain ascii:

text = "fox The quick brown fox jumps over the fox! lazy dog. fox!."
links = {'fox': [], 'dog': []}

from string import punctuation
def yield_words(s,d):
    i = 0
    for ele in s.split(" "):
        tot = len(ele) + 1
        ele = ele.rstrip(punctuation)
        ln = len(ele)
        if ele in d:
            d[ele].append((i, ln + i))
        i += tot
    return d

This unlike the find solution won't match partial words and does it in O(n) time:

In [2]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."

In [3]: links = {'fox': [], 'dog': []}

In [4]: yield_words(text,links)
Out[4]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}

This is probably one case where a regex is a good approach, it can just be much simpler:

def reg_iter(s,d):
    r = re.compile("|".join([r"\b{}\b".format(w) for w in d]))
    for match in r.finditer(s):
        links[match.group()].append((match.start(),match.end()))
    return d

Output:

In [6]: links = {'fox': [], 'dog': []}

In [7]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."

In [8]: reg_iter(text, links)
Out[8]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • So far, your answer is my most preferred one - the reg_iter - is shorter, a bit faster and it solves an edge case that is not even mentioned in my question: while I am processing lot of text with german umlauts your code just worked on this too. – rebeling Oct 03 '15 at 23:38
  • Rating and explanation will be added soon - may there will be something else put on the table we both never dreamed of, thanks for your answer ;) – rebeling Oct 03 '15 at 23:39
  • @rebeling, no worries, glad it helped – Padraic Cunningham Oct 03 '15 at 23:41