-3

Trying to properly extract all terms from the text. Looks like when term is inside sentence and term contains () it's not splitted and regex couldn't find it.

I'm trying to properly split matches that contain (). So Instead of this:

["What is API(Application Programming Interface) and how to use it?"]

I'm trying to get this:

["What is", "API(Application Programming Interface)", "and how to use it?"]

JSON term is properly extracted and I'm getting this:

["JSON", "is a Javascript Object Notation"] so this is exactly what I want but in case of API I'm not getting this:

["What is", "API(Application Programming Interface)", "and how to use it?"]

I'm getting this and this is not what I want:

["What is API(Application Programming Interface) and how to use it?"]

function getAllTextNodes(element) {
    let node;
    let nodes = [];
    let walk = document.createTreeWalker(element,NodeFilter.SHOW_TEXT,null,false);
    while (node = walk.nextNode()) nodes.push(node);
    return nodes;
  }

const allNodes = getAllTextNodes(document.getElementById("body"))

const terms = [
    {id: 1, definition: 'API stands for Application programming Interface', expression: 'API(Application Programming Interface)'},
    {id: 2, definition: 'JSON stands for JavaScript Object Notation.', expression: 'JSON'}
]

const termMap = new Map(
      [...terms].sort((a, b) => b.expression.length - a.expression.length)
                .map(term => [term.expression.toLowerCase(), term])
    );

const regex = RegExp("\\b(" + Array.from(termMap.keys()).join("|") + ")\\b", "ig");

for (const node of allNodes) {
    const pieces = node.textContent.split(regex).filter(Boolean);
    console.log(pieces)
}
<div id="body">
    <p>API(Application Programming Interface)</p>
    <p>What is API(Application Programming Interface) and how to use it?</p>
    <p>JSON is a Javascript Object Notation</p>
</div>
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
Denis Omerovic
  • 1,420
  • 1
  • 10
  • 23
  • And the problem/question is? And what have you tried so far to solve this on your own? -> [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) – Andreas Jan 24 '22 at 12:48
  • [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask): _"Write a title that **summarizes the specific problem**"_ – Andreas Jan 24 '22 at 12:49
  • @Andreas sorry about that. So I created regex to match all the terms inside `#body` and properly split each node into array. So the only problem I have is how to properly split sentence when term is containing `()` – Denis Omerovic Jan 24 '22 at 12:52
  • Escape the terms in your regex. And if you can have special chars at the start/end of the string, you can't use `\b` word boundaries. – Wiktor Stribiżew Jan 24 '22 at 12:53

1 Answers1

1

As your "words" can consist of non-word chars, you cannot rely on word boundaries. I would suggest switching to either unambiguous ((?<!\w)/(?!\w)) or adaptive dynamic word boundaries.

Besides, you need to escape your terms before using in the regex.

See below an example with adaptive word boundaries:

function getAllTextNodes(element) {
    let node;
    let nodes = [];
    let walk = document.createTreeWalker(element,NodeFilter.SHOW_TEXT,null,false);
    while (node = walk.nextNode()) nodes.push(node);
    return nodes;
  }

const allNodes = getAllTextNodes(document.getElementById("body"))

const terms = [
    {id: 1, definition: 'API stands for Application programming Interface', expression: 'API(Application Programming Interface)'},
    {id: 2, definition: 'JSON stands for JavaScript Object Notation.', expression: 'JSON'}
]

const termMap = new Map(
      [...terms].sort((a, b) => b.expression.length - a.expression.length)
                .map(term => [term.expression.toLowerCase(), term])
    );

const regex = RegExp("(?!\\B\\w)(" + Array.from(termMap.keys()).map(x => x.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&')).join("|") + ")(?<!\\w\\B)", "ig");

for (const node of allNodes) {
    const pieces = node.textContent.split(regex).filter(Boolean);
    console.log(pieces)
}
<div id="body">
    <p>API(Application Programming Interface)</p>
    <p>What is API(Application Programming Interface) and how to use it?</p>
    <p>JSON is a Javascript Object Notation</p>
</div>

The regex is now (?!\B\w)(api\(application programming interface\)|json)(?<!\w\B) where

  • (?!\B\w) - left-hand adaptive word boundary (with no context-checking if the following char is a non-word char)
  • (api\(application programming interface\)|json) - Group 1 matching one of your terms (see escape special chars)
  • (?<!\w\B) - right-hand adaptive word boundary (with no context-checking if the preceding char is a non-word char)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563