2

I have a string that looks something like this:

<strong>word</strong>: or <em>word</em> or <p><strong>word</strong>: this is a sentence</p> etc...

I am trying to parse each string into an array without the html element.
For example the string:

<strong>word</strong>

should end up being an array that looks like this:

['word', ':']

The string:

<p><strong>word</strong>: this is a sentence</p>

should end up being an array that looks like this:

['word', ':', 'this', 'is', 'a', 'sentence']      

Is there anyway to do this via Javascript? My code below is creating an array of individual characters rather than words separated by spaces.

//w = the string I want to parse
var p = document.querySelector("p").innerText;

var result = p.split(' ').map(function(w) {
  if (w === '')
    return w;
  else {
    var tempDivElement = document.createElement("div");
    tempDivElement.innerHTML = w;

    const wordArr = Array.from(tempDivElement.textContent);
    return wordArr;
  }
});
console.log(result)
<p><strong>word</strong>: this is a sentence</p>
mplungjan
  • 169,008
  • 28
  • 173
  • 236
adhoc
  • 177
  • 2
  • 12

6 Answers6

3

I would make the temp div first and extract the inner text. Then use match() to find words (note \w matches letters, numbers and underscore). This will treat the punctuation like : as separate words, which seems to be what you want.

p = '<strong>word</strong>: or <em>word</em> or <p><strong>word</strong>: this is a sentence</p>'

var tempDivElement = document.createElement("div");
tempDivElement.innerHTML = p;

let t = tempDivElement.innerText
let words = t.match(/\w+|\S/g)
console.log(words)

If you just want the words, match only on \w:

p = '<strong>word</strong>: or <em>word</em> or <p><strong>word</strong>: this is a sentence</p>'

var tempDivElement = document.createElement("div");
tempDivElement.innerHTML = p;

let t = tempDivElement.innerText
let words = t.match(/\w+/g)
console.log(words)
Mark
  • 90,562
  • 7
  • 108
  • 148
  • Yeah . . . your first solution looks like the best approach so far on the page. If he wants to group non-word characters, a small change to the regex that you use in the `match()` would do that: `t.match(/\w+|\S+/g)` – talemyn May 17 '19 at 18:25
  • Thank you, this is definitely what I was looking for. This is the best answer because it accounts for the : and non word characters and adding them into it's own index in an array. – adhoc May 17 '19 at 22:15
0

you can do that by creating a temp HTML element and then simply get its textContent.

example:

/*to get words only seprated by space*/
function myFunction1(htmlString) {
  var div = document.createElement('div');
  div.innerHTML = htmlString;
  return (div.textContent || div.innerText).toString().split(" ");
};

/* to get words seprated by space as well as HTML tags */
function myFunction2(htmlString) {
  var div = document.createElement('div');
  div.innerHTML = htmlString;
  var children = div.querySelectorAll('*');
  for (var i = 0; i < children.length; i++) {
    if (children[i].textContent)
      children[i].textContent += ' ';
    else
      children[i].innerText += ' ';
  }
  return (div.textContent || div.innerText).toString().split(" ");
};

console.log('function 1 result:');
console.log(myFunction1("<strong>word</strong>: or <em>word</em> or <p><strong>word</strong>: this is a sentence</p>etc..."));
console.log('function 2 result: ');
console.log(myFunction2("<strong>word</strong>: or <em>word</em> or <p><strong>word</strong>: this is a sentence</p>etc..."));
saurabh
  • 2,553
  • 2
  • 21
  • 28
  • Wow! This is so much better than my solution! I did not even know about the `textContent` attribute. – abalter May 17 '19 at 17:38
0

One possible way is to use the builting DOMParser method:

var string = '<strong>word</strong>: or <em>word</em> or <p><strong>word</strong>: this is a sentence</p> etc...';
var doc = new DOMParser().parseFromString(string, 'text/html');

You would then need to recursively decend into the doc HTMLDocument object throught the childNodes.

Similarly, you could use a client-side javascript web scraper such as artoo.js and examine the nodes that way.

As far as the strings that are NOT in an actual tag, such as ": or" you would need to wrap the string in a <p> tag or something first.

abalter
  • 9,663
  • 17
  • 90
  • 145
0

Based on this answer: https://stackoverflow.com/a/2579869/1921385 you can recursively iterate over each node and add the text parts to an array. EG:

var items = [];
var elem = document.querySelector("div");
function getText(node) {
    // recurse into each child node
    if (node.hasChildNodes()) {
        node.childNodes.forEach(getText);
    } else if (node.nodeType === Node.TEXT_NODE) {
        const text = node.textContent.trim();
        if (text) {
            var words = text.split(" ");
            words.forEach(function(word) {
              items.push(word);
            });
        }
    }
}
//
getText(elem);
console.log(items);
<div><strong>word</strong>: or <em>word</em> or <p><strong>word</strong>: this is a sentence</p></div>
Moob
  • 14,420
  • 1
  • 34
  • 47
0

The colon after the "word" value is the tricky part, but using the textContent attribute and some string manipulation, you can set up a string that can be split() into the array that you are looking for.

First, collect the element to be parsed:

var p = document.querySelector("p");

Next, get the text content from inside of it using the "textContent" attribute:

var pContent = p.textContent;

Next, "massage" the content to make sure that any "non-word" characters are separated from the words, without being lost (the space on either end handles non-word characters before and after the words):

var result = pContent.replace(/(\W+)/g, " $0 ");

Next, trim any leading or trailing spaces, to avoid emty elements at the beginning and ending of the array:

var result = result.trim();

Then finally, split the updated string by blocks of whitespace:

var result = result.split(/\s+/);

What makes this even better, though, is that you can actually do all that manipulation in one line of code, if you want to, as seen in the condensed solution below:

var element1 = document.querySelector("#element1");
var element2 = document.querySelector("#element2");
var element3 = document.querySelector("#element3");

function elementTextToArray(element) {
  return element.textContent.replace(/(\W+)/g, " $0 ").trim().split(/\s+/);
}

console.log(elementTextToArray(element1));
console.log(elementTextToArray(element2));
console.log(elementTextToArray(element3));
<p id="element1"><strong>word</strong></p>
<p id="element2"><strong>word</strong>: this is a sentence</p>
<p id="element3"><strong>word</strong>: this is a sentence <em>with multiple levels of <strong>depth</strong> in it!!!</em></p>

UPDATE #1 Made the "non-word" check both greedy (captures all non-word characters)and able to capture groups of non-word characters (like "!!!").

talemyn
  • 7,822
  • 4
  • 31
  • 52
0
  1. To make this work correctly in this Snippet a <div> is wrapped around the target HTML.
  2. Extract the text with .textContent
  3. Clean it up with .replace() passing the regex /(\s+|\n)/g which will replace any number of adjacent spaces OR newline characters with a single space. The string is .trim() at both ends.
  4. Then .split() the string at every space.

let text = document.querySelector('.content').textContent;
let clean = text.replace(/(\s+|\n)/g, ' ').trim();
let array = clean.split(' ');
console.log(array);
<div class='content'>
  <strong>word</strong>: or <em>word</em> or
  <p><strong>word</strong>: this is a sentence</p> etc...
</div>
zer00ne
  • 41,936
  • 6
  • 41
  • 68