I have an array
of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex
will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.
By example, given the text
var text = "Steve down walks warily down the street down\nWith the brim pulled way down low";
the first occurrence of the token down
is mapped to the last position in the text matched by the RegExp
, hence I have:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}
This becomes clear running this example:
var text = "Steve down walks warily down the street down\nWith the brim pulled way down low";
var tokens = text.split(/\s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\\b(" + word + ")\\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)
where the first occurrence of the token down
should be the first matching position:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}
So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down
with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin
and charOffsetEnd
hence doing like:
var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += '\n';
});