Javascript break text by every word, but also store the start and end index

Question

So I am trying to make an array of every word in a text and the array should be like [word, startIndex, endIndex]. I am going to use this to replace words after, after checking the word-type and find a synonym for it to replace it with. But the problem I am facing is splitting each word and storing the start and end index. text.match(/\b(\w+)\b/g) works, but I do not get the start and end index that I need. I also tried making some function to parse the text, but it ended up overcomplicated and not really working like it should.

So i wondered if anybody in the javascript community here has a better solution or know how to make an easy function for it.

This is what I would like to happen.

Input:

Norway, officially the Kingdom of Norway, is a sovereign state and unitary monarchy whose territory comprises the western portion of the Scandinavian Peninsula

Output:

['Norway', 0, 6], ['officially', 8, 18]

And the same for all words

You have to show the code you have used. We avoid just asking for suggestions at Stack Overflow — Ruan Mendes, Mar 21 '18 at 15:27
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec — Keith, Mar 21 '18 at 15:28
Can you clarify why you want the last index of Norway to be 6, since strings are 0 indexed, so it would really be 5? — user3483203, Mar 21 '18 at 15:39

score 1 · Accepted Answer · answered Mar 21 '18 at 15:34

Partly taken from: Return positions of a regex match() in Javascript? but adapted to return the length of the match and the match itself:

var wordIndices = (s) => {
  var getAllWords = /\b(\w+)\b/g;
  var output = [];
  while ((match = getAllWords.exec(s)) != null) {
    output.push([match[0], match.index, match.index + match[0].length-1])
  }
  return output
}

s = 'Norway, officially the Kingdom of Norway, is a sovereign state and unitary monarchy whose territory comprises the western portion of the Scandinavian Peninsula';


console.log(wordIndices(s))

Keith · Answer 2 · 2018-03-21T15:39:34.900

1

I think you example results was slightly wrong ['Norway', 0, 6], ['officially', 9, 19], last should have been 8,18..

So the following might be what your after.

var str1 = `Norway, officially the Kingdom of Norway, is a sovereign state and unitary monarchy whose territory comprises the western portion of the Scandinavian Peninsula`;

var regex1 = RegExp(/\b(\w+)\b/g);
var array1;
var ret = [];

while ((array1 = regex1.exec(str1)) !== null) {
  ret.push([array1[0], array1.index, 
    array1.index + array1[0].length - 1]);
}

console.log(ret);

edited Mar 21 '18 at 15:39

answered Mar 21 '18 at 15:36

Keith

22,005
2
27
44

1

Depends how he wants the result, and what he means by lastIndex. But I've just done a quick mod to do it the way we think of it. – Keith Mar 21 '18 at 15:39

score 0 · Answer 3 · answered Mar 21 '18 at 15:37

If your goal is to replace those words, there is an easier solution. You can just use replace with a callback function.

Example:

const input = 'Norway, officially the Kingdom of Norway, is a sovereign state and unitary monarchy whose territory comprises the western portion of the Scandinavian Peninsula'


const output = input.replace(/\b(\w+)\b/g, (word, group, index) => {
    console.log(word, index);

    if (word.length <= 3) {
        return '...';
    } else {
        return word;
    }
})

console.log(output);

Javascript break text by every word, but also store the start and end index

3 Answers3