1

I am trying to find a string within a long text extracted from a PDF file, and get the string's position in the text, and then return 100 words before the string and 100 after. The problem is that the extraction is not perfect, so I am having a problem like this:

The query string is "test text"

The text may look like:

This is atest textwith a problem

as you can see the word "test" is joined with the letter "a" and the word "text" is joined with the word "with"

So the only function is working with me is __contains __ which doesn't return the position of the word.

Any ideas to find all the occurences of a word in such a text with their postions?

Thank you very much

The Maestro
  • 659
  • 1
  • 5
  • 21
  • May be this link helps: http://stackoverflow.com/questions/250271/python-regex-how-to-get-positions-of-matches – Humbalan Oct 12 '16 at 14:39
  • Have you tried `str.find`? A common pattern is to repeatedly scan the string, each time starting at the previous hit. By the way, what is your expected output for the example? The character index `9`, the word index `2` or the cleaned-up-words index `3`? – MisterMiyagi Oct 12 '16 at 14:42

4 Answers4

4

You did not specify all your requirements but this works for your current problem. The program prints out 9 and 42, which are the beginning of two occurrences of the test text.

import re
filt = re.compile("test text")

for match in filt.finditer('This is atest textwith a problem. another test text'):
    print match.start()
Prophecies
  • 723
  • 1
  • 7
  • 19
3

You could take the following kind of approach. This first attempts to split the whole text into words, and keeps note of the index of each word.

Next it iterates through the text looking for test text with possible 0 or more spaces between. For each match it notes the start and then creates a list of words found before and after that point using Python's bisect library to locate the required entries in the words list.

import bisect
import re

test = "aa bb cc dd test text ee ff gg testtextwith hh ii jj"

words = [(w.start(), w.group(0)) for w in re.finditer(r'(\b\w+?\b)', test)]

adjacent_words = 2

for match in re.finditer(r'(test\s*?text)', test):
    start, end = match.span()

    words_start = bisect.bisect_left(words, (start, ''))
    words_end = bisect.bisect_right(words, (end, ''))

    words_before = [w for i, w in words[words_start-adjacent_words : words_start]]
    words_after = [w for i, w in words[words_end : words_end + adjacent_words]]

    #  Adjacent words as a list
    print words_before, match.group(0), words_after

    # Or, surrounding text as is.
    print test[words[words_start-adjacent_words][0] : words[words_end+adjacent_words][0]]

    print

So for this example with 2 adjacent words, you would get the following output:

['cc', 'dd'] test text ['ee', 'ff']
cc dd test text ee ff 

['ff', 'gg'] testtext ['hh', 'ii']
ff gg testtextwith hh ii
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
2

If you're looking for the position of the text within the string, you can use string.find().

>>> query_string = 'test text'
>>> text = 'This is atest textwith a problem'
>>> if query_string in text:
        print text.find(query_string)
9
pault
  • 41,343
  • 15
  • 107
  • 149
  • `find()` returns the index of the first instance (lowest index) in the search string, so you'd have to iteratively search the text multiple times to find all locations. – pault Oct 12 '16 at 14:43
1

You might have a look at the regex module which allows for 'fuzzy' matching:

>>> import regex
>>> s='This is atest textwith a problem'
>>> regex.search(r'(?:text with){e<2}', s)
<regex.Match object; span=(14, 22), match='textwith', fuzzy_counts=(0, 0, 1)>
>>> regex.search(r'(?:test text){e<2}', s)
<regex.Match object; span=(8, 18), match='atest text', fuzzy_counts=(0, 1, 0)>

You can match text that has insertions, deletions, and errors. The match group returned has the span and index.

You can use regex.findall to find all the potential target matches.

Perfect for what you are describing.

dawg
  • 98,345
  • 23
  • 131
  • 206