1

I have an array of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.

By example, given the text

var text = "Steve down walks warily down the street down\nWith the brim pulled way down low";

the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:

 {
    "index": 2,
    "word": "down",
    "characterOffsetBegin": 70,
    "characterOffsetEnd": 73
  }

This becomes clear running this example:

var text = "Steve down walks warily down the street down\nWith the brim pulled way down low";
var tokens = text.split(/\s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
  let item = {
    "index": (tokenIndex + 1),
    "word": word
  }
  var wordRegex = RegExp("\\b(" + word + ")\\b", "g");
  var match = null;
  while ((match = wordRegex.exec(text)) !== null) {
    var wordStart = match.index;
    var wordEnd = wordStart + word.length - 1;
    item.characterOffsetBegin = wordStart;
    item.characterOffsetEnd = wordEnd;
  }
  return item;
});
console.log(annotations)

where the first occurrence of the token down should be the first matching position:

 {
    "index": 2,
    "word": "down",
    "characterOffsetBegin": 6,
    "characterOffsetEnd": 9
  }

So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin and charOffsetEnd hence doing like:

                var newtext = '';
                results.sentences.forEach(sentence => {
                    sentence.tokens.forEach(token => {
                        newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
                    });
                    newtext += '\n';
                });
loretoparisi
  • 15,724
  • 11
  • 102
  • 146
  • 1
    [Looks like it's not straight-forward](https://stackoverflow.com/q/1985594/463206) – radarbob Nov 21 '18 at 00:33
  • The problem is not the expression is greedy, the problem is that you are looking for every match via the `while` loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the `while` loop. – Felix Kling Nov 21 '18 at 00:33
  • @FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like `tokenMap[ matchIndex ]`, but I'm not sure of it. – loretoparisi Nov 21 '18 at 00:42
  • I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times? – Felix Kling Nov 21 '18 at 00:49
  • 1
    The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct? – Felix Kling Nov 21 '18 at 00:51
  • @FelixKling you assumption was right. A token like `down` maybe present one or more times in the sentence i.e. in the array, so your solution looked good to me using the `Map` at the first instance (I was trying it as you deleted it :D – loretoparisi Nov 21 '18 at 00:52
  • Ok then :D .... – Felix Kling Nov 21 '18 at 00:53
  • ...yes because basically I can the reconstruct the input text from these start and end positions with the right indexes and I can put overlay on a token in its position. Adding an example. – loretoparisi Nov 21 '18 at 00:54

2 Answers2

2

The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.

You have to do two things:

  • Stop iterating once you found a match.
  • Keep track of previous matches so that you can ignore them.

I believe this is what you want:

var text = "Steve down walks warily down the street down\nWith the brim pulled way down low";
var tokens = text.split(/\s+/g);
const seen = new Map();

var annotations = tokens.map((word, tokenIndex) => { // for each token
  let item = {
    "index": (tokenIndex + 1),
    "word": word
  }
  var wordRegex = RegExp("\\b(" + word + ")\\b", "g");
  var match = null;
  while ((match = wordRegex.exec(text)) !== null) {
    if (match.index > (seen.get(word) || -1)) {
      var wordStart = match.index;
      var wordEnd = wordStart + word.length - 1;
      item.characterOffsetBegin = wordStart;
      item.characterOffsetEnd = wordEnd;

      seen.set(word, wordEnd);
      break;
    }
  }
  return item;
});
console.log(annotations)

The seen map keeps track of the end position of the most recent match for a token.

Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1)).

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
1

@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.

I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.

class Annotations {
  constructor(text) {
    if(typeof text !== 'string') return null
    const opt = { enumerable: false, configurable: false, writeable: false }
    Object.defineProperty(this, 'text', { value: text, ...opt })
    Object.defineProperty(this, 'tokens', { value: text.split(/\s+/g), ...opt })
    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
  }
  * matchAll(token) {
    if(typeof token === 'string' && this.text.indexOf(token) > -1) {
      const expression = new RegExp("\\b" + token + "\\b", "g")
      let match = expression.exec(this.text)

      while(match !== null) {
        const start = match.index
        const end = start + token.length - 1
        yield { start, end }
        match = expression.exec(this.text)
      }
    }
  }
}

const annotations = new Annotations("Steve down walks warily down the street down\nWith the brim pulled way down low")

console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) // []
.as-console-wrapper { max-height: 100% !important }
  • This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too! – loretoparisi Nov 21 '18 at 08:15