Find all the occurences of a string in an imperfect text

Question

I am trying to find a string within a long text extracted from a PDF file, and get the string's position in the text, and then return 100 words before the string and 100 after. The problem is that the extraction is not perfect, so I am having a problem like this:

The query string is "test text"

The text may look like:

This is atest textwith a problem

as you can see the word "test" is joined with the letter "a" and the word "text" is joined with the word "with"

So the only function is working with me is __contains __ which doesn't return the position of the word.

Any ideas to find all the occurences of a word in such a text with their postions?

Thank you very much

May be this link helps: http://stackoverflow.com/questions/250271/python-regex-how-to-get-positions-of-matches — Humbalan, Oct 12 '16 at 14:39
Have you tried `str.find`? A common pattern is to repeatedly scan the string, each time starting at the previous hit. By the way, what is your expected output for the example? The character index `9`, the word index `2` or the cleaned-up-words index `3`? — MisterMiyagi, Oct 12 '16 at 14:42

score 4 · Accepted Answer · answered Oct 12 '16 at 14:44

4

You did not specify all your requirements but this works for your current problem. The program prints out 9 and 42, which are the beginning of two occurrences of the test text.

import re
filt = re.compile("test text")

for match in filt.finditer('This is atest textwith a problem. another test text'):
    print match.start()

answered Oct 12 '16 at 14:44

Prophecies

723
1
7
19

It appears that the OP only is looking for *one* occurrence. – brianpck Oct 12 '16 at 14:46
1

The title though states "Find all the occurences" ? – Martin Evans Oct 12 '16 at 14:52
Thank you this solution was perfect for my problem. Many thanks to you ^_^ – The Maestro Oct 13 '16 at 11:35

Martin Evans · Answer 2 · 2016-10-12T17:59:14.247

You could take the following kind of approach. This first attempts to split the whole text into words, and keeps note of the index of each word.

Next it iterates through the text looking for test text with possible 0 or more spaces between. For each match it notes the start and then creates a list of words found before and after that point using Python's bisect library to locate the required entries in the words list.

import bisect
import re

test = "aa bb cc dd test text ee ff gg testtextwith hh ii jj"

words = [(w.start(), w.group(0)) for w in re.finditer(r'(\b\w+?\b)', test)]

adjacent_words = 2

for match in re.finditer(r'(test\s*?text)', test):
    start, end = match.span()

    words_start = bisect.bisect_left(words, (start, ''))
    words_end = bisect.bisect_right(words, (end, ''))

    words_before = [w for i, w in words[words_start-adjacent_words : words_start]]
    words_after = [w for i, w in words[words_end : words_end + adjacent_words]]

    #  Adjacent words as a list
    print words_before, match.group(0), words_after

    # Or, surrounding text as is.
    print test[words[words_start-adjacent_words][0] : words[words_end+adjacent_words][0]]

    print

So for this example with 2 adjacent words, you would get the following output:

['cc', 'dd'] test text ['ee', 'ff']
cc dd test text ee ff 

['ff', 'gg'] testtext ['hh', 'ii']
ff gg testtextwith hh ii

Thanks!! I have learned many things from your answer. Very valuble :D — The Maestro, Oct 13 '16 at 11:34

score 2 · Answer 3 · answered Oct 12 '16 at 14:41

2

If you're looking for the position of the text within the string, you can use string.find().

>>> query_string = 'test text'
>>> text = 'This is atest textwith a problem'
>>> if query_string in text:
        print text.find(query_string)
9

answered Oct 12 '16 at 14:41

pault

41,343
15
107
149

`find()` returns the index of the first instance (lowest index) in the search string, so you'd have to iteratively search the text multiple times to find all locations. – pault Oct 12 '16 at 14:43

score 1 · Answer 4 · answered Oct 12 '16 at 14:57

You might have a look at the regex module which allows for 'fuzzy' matching:

>>> import regex
>>> s='This is atest textwith a problem'
>>> regex.search(r'(?:text with){e<2}', s)
<regex.Match object; span=(14, 22), match='textwith', fuzzy_counts=(0, 0, 1)>
>>> regex.search(r'(?:test text){e<2}', s)
<regex.Match object; span=(8, 18), match='atest text', fuzzy_counts=(0, 1, 0)>

You can match text that has insertions, deletions, and errors. The match group returned has the span and index.

You can use regex.findall to find all the potential target matches.

Perfect for what you are describing.

Find all the occurences of a string in an imperfect text

4 Answers4