I wrote this function findTokenOffset
that finds the offset of a given word in a pre-tokenized text (as a list of spaced words or according to a certain tokenizer).
import re, json
def word_regex_ascii(word):
return r"\b{}\b".format(re.escape(word))
def findTokenOffset(text,tokens):
seen = {} # map if a token has been see already!
items=[] # word tokens
my_regex = word_regex_ascii
# for each token word
for index_word,word in enumerate(tokens):
r = re.compile(my_regex(word), flags=re.I | re.X | re.UNICODE)
item = {}
# for each matched token in sentence
for m in r.finditer(text):
token=m.group()
characterOffsetBegin=m.start()
characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
found=-1
if word in seen:
found=seen[word]
if characterOffsetBegin > found:
# store last word has been seen
seen[word] = characterOffsetEnd
item['index']=index_word+1 #// word index starts from 1
item['word']=token
item['characterOffsetBegin'] = characterOffsetBegin
item['characterOffsetEnd'] = characterOffsetEnd
items.append(item)
break
return items
This code works ok when the tokens are single words like
text = "George Washington came to Washington"
tokens = text.split()
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))
But, supposed to have tokens having a multi-token fashion like here:
text = "George Washington came to Washington"
tokens = ["George Washington", "Washington"]
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))
the offset does not work properly, due to repeating words in different tokens:
[
{
"index": 1,
"word": "George Washington",
"characterOffsetBegin": 0,
"characterOffsetEnd": 16
},
{
"index": 2,
"word": "Washington",
"characterOffsetBegin": 7,
"characterOffsetEnd": 16
}
]
How to add support to multi-token and overlapped token regex matching (thanks to the suggestion in comments for this exact problem's name)?