Python find offsets of a word token in a text

Question

I wrote this function findTokenOffset that finds the offset of a given word in a pre-tokenized text (as a list of spaced words or according to a certain tokenizer).

import re, json

def word_regex_ascii(word):
    return r"\b{}\b".format(re.escape(word))

def findTokenOffset(text,tokens):
  seen = {} # map if a token has been see already!
  items=[] # word tokens
  my_regex = word_regex_ascii
  # for each token word
  for index_word,word in enumerate(tokens):

      r = re.compile(my_regex(word), flags=re.I | re.X | re.UNICODE)

      item = {}
      # for each matched token in sentence
      for m in r.finditer(text):

          token=m.group()
          characterOffsetBegin=m.start()
          characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
          
          found=-1
          if word in seen:
              found=seen[word]
          
          if characterOffsetBegin > found:
              # store last word has been seen
              seen[word] = characterOffsetEnd
              item['index']=index_word+1 #// word index starts from 1
              item['word']=token
              item['characterOffsetBegin'] = characterOffsetBegin
              item['characterOffsetEnd'] = characterOffsetEnd
              items.append(item)

              break
  return items

This code works ok when the tokens are single words like

text = "George Washington came to Washington"
tokens = text.split()
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))

But, supposed to have tokens having a multi-token fashion like here:

text = "George Washington came to Washington"
tokens = ["George Washington", "Washington"]
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))

the offset does not work properly, due to repeating words in different tokens:

[
  {
    "index": 1,
    "word": "George Washington",
    "characterOffsetBegin": 0,
    "characterOffsetEnd": 16
  },
  {
    "index": 2,
    "word": "Washington",
    "characterOffsetBegin": 7,
    "characterOffsetEnd": 16
  }
]

How to add support to multi-token and overlapped token regex matching (thanks to the suggestion in comments for this exact problem's name)?

What is your expected output? is this related to overlapped regex matching? — Aaron, Apr 01 '21 at 15:49
Do that with a single pass. How large is your vocabulary (number of `tokens`)? — Wiktor Stribiżew, Apr 01 '21 at 15:50
Hey @Aaron yes it's a "token overlapping" regex matching, correct. — loretoparisi, Apr 01 '21 at 16:09
@WiktorStribiżew Hi Wiktor, I do not have a specific number, let's say it's a whole document, like a wikipedia page, so it could be huge or not. I's interesting your single pass approach. I was thinking to this, like a sort of vectorial algorithm, but did not figured out how. — loretoparisi, Apr 01 '21 at 16:10
This approach may become very messy if you need to have the token indices in the resulting JSON, see https://ideone.com/jMEtRb. It looks fine but you just have no pretty way to get the information on what token matched. — Wiktor Stribiżew, Apr 01 '21 at 16:50
@WiktorStribiżew thanks. This approach works ok in most of cases, but it fails when having like `text = "George Washington came to Washington Washington.com"` and `tokens = ["George Washington", "Washington", "Washington.com"]` this will result in matching `Washington` within `Washington.com` — loretoparisi, Apr 02 '21 at 13:50
@WiktorStribiżew wops, sorry that works, I meant this sequence: `tokens = ["George Washington", "Washington"]`. In this case the second token will match twice: the whole word and the token within `Washington.com`. — loretoparisi, Apr 02 '21 at 14:06
@loretoparisi But what is the token boundary then? I thought it was a word boundary. — Wiktor Stribiżew, Apr 02 '21 at 14:14
ah so. It should not match the word `Washington` when in `Washington.com`, so it should respect the boundary, yes, to match only when it's alone like `\b$1\b` — loretoparisi, Apr 02 '21 at 14:30
What is the token boundary? `Washington` in `Washington.com` is a whole word as there is a word boundary between `n` and `.`. It is a valid match given the requirements. If you want to fail the match if there is a `.` right after a word and there is a word char on the right, you can use `re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})\b(?!\.\b)', re.I )`, see [demo](https://ideone.com/A74a9M). — Wiktor Stribiżew, Apr 02 '21 at 15:59
Please let know if that is something worth posting, I will post as an answer with explanations. — Wiktor Stribiżew, Apr 02 '21 at 18:55

score 1 · Accepted Answer · answered Apr 05 '21 at 09:44

If you do not need the search phrase/word index information in the resulting output, you can use the following approach:

import re,json
 
def findTokenOffset(text, pattern):
    items = []
    for m in pattern.finditer(text):
        item = {}
        item['word']=m.group()
        item['characterOffsetBegin'] = m.start()
        item['characterOffsetEnd'] = m.end()
        items.append(item)
    return items
 
text = "George Washington came to Washington Washington.com"
tokens = ["George Washington", "Washington"]
pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})(?!\w)(?!\.\b)', re.I )
offsets = findTokenOffset(text,pattern)
print(json.dumps(offsets, indent=2))

The output of the Python demo:

[
  {
    "word": "George Washington",
    "characterOffsetBegin": 0,
    "characterOffsetEnd": 17
  },
  {
    "word": "Washington",
    "characterOffsetBegin": 26,
    "characterOffsetEnd": 36
  }
]

The main part is pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})\b(?!\.\b)', re.I ) that does the following:

map(re.escape, tokens) - escapes special chars inside tokens strings
sorted(..., key=len, reverse=True) - sorts the items in escaped tokens by length in a descending order (so that Washigton Post could match earlier than Washington)
"|".join(...) - created an alternation list of tokens, token1|token2|etc
(?<!\w)(?:...)(?!\w)(?!\.\b) - is the final pattern that matches all the alternatives in tokens as whole words. (?<!\w) and (?!\w) are used to enable word boundary detection even if the tokens start/end with a special character.

NOTE ON WORD BOUNDARIES

You should check your token boundary requirements. I added (?!\.\b) as you mention that Washington should not match in Washington.com, so I inferred to want to fail any word match when it is immediately followed with . and a word boundary. There are a lot of other possible solutions, the main one being whitespace boundaries, (?<!\S) and (?!\S).

Besides, see Match a whole word in a string using dynamic regex.

score 0 · Answer 2 · answered Apr 01 '21 at 15:49

0

If you want to lookup for Washington, but not George Washington, you can remove the sentences you found from initial string. So, you can sort the 'tokens' by the word quantity. That gives you an opportunity to firstly scan the senteces, and after that, the words.

answered Apr 01 '21 at 15:49

Dan

36
1

I think you are suggesting to use a placeholder at each matching step, right? – loretoparisi Apr 01 '21 at 16:12

Python find offsets of a word token in a text

2 Answers2