Split innerhtml into text for translation JSON in javascript

Question

Currently I am working on an application that needs to extract the innerHTML of Body and then take the text out of it in a JSON. That JSON will be used for translation and then the translated JSON will be used as input to create the same HTML markup but with translated text. Please see the snippet below.

HTML Input

<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';

Translation JSON Output

{
"text1":"Hello, ",
"text2":"This is some text which I need to extract.",
"text3":"It can be <strong> complicated.</strong>",
"text4":"The extracted text should contain the html tag if it 
             has any html tag in the span,p or a tag",
"text5":"Please see the <span>desired output below.</span>",
"text6":"Thanks!"
}

Translated JSON Input

{
"text1":"Hello,-in spanish ",
"text2":"This is some text which I need to extract.-in spanish",
"text3":"It can be <strong> complicated.-in spanish</strong>",
"text4":"The extracted text should contain the html tag if it 
             has any html tag in the span,p or a tag-in spanish",
"text5":"Please see the <span>desired output below.-in spanish</span>",
"text6":"Thanks!-in spanish"
}

Translated HTML Output

<section>Hello,-in spanish <div>This is some text which I need to extract.-in spanish<a class="link">It can be <strong> complicated.-in spanish</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag-in spanish</span><p>Please see the <span>desired output below.</span></p>Thanks!-in spanish</section>';

I tried various regex but below is the one of the flows I ended up doing but I am not able to achieve the desired output with this.

//encode
const bodyHTML = '<a class="test">hello world<strong> this is gonna be hard</strong></a>';
//replace the quotes with escape quotes
const htmlContent = bodyHTML.replace(/"/g, '\\"');
let count = 0;
let translationObj = {};
let newHtml = htmlContent.replace(/\>(.*?)\</g, function(match) {
  //remove the special character 
  match = match.replace(/\>|\</g, '');
  count = count + 1;
  translationObj[count] = match;

  return '>~' + count + '~<';
});

const translationJSON = '{"1":"hello world in spanish","2":" this is gonna be hard in spanish","3":""}';

//decode
let trasnaltedHtml = '';
const translatedObj = JSON.parse(translationJSON)
trasnaltedHtml = newHtml.replace(/\~(.*?)\~/g, function(match) {
  //remove the special character 
  match = match.replace(/\~|\~/g, '');

  return translatedObj[match];
});
//replace the escape quotes with quotes
trasnaltedHtml = trasnaltedHtml.replace(/\\"/g, '"');
//console.log()
console.log("bodyHTML", bodyHTML);
console.log('tranlationObj', translationObj);
console.log("translationJSON", translationJSON);
console.log('newHtml', newHtml);
console.log("trasnaltedHtml", trasnaltedHtml);

I am looking for a working regex or any other approach in JS world that would get the desired result. I wanna get all the text inside HTML in the form of JSON. Another condition is not to split the text if they have some inner html tags so that we don't loose the context of the sentence like <p>Click <a>here</a></p> it should be considered as one text "Click <a>here</a>". I hope I clarified all the doubts

Thanks in advance !

you can extract text in client with something like: jQuery( "body:contains(Text)" ).text() - you could enhance if your extractable elements hadd a specific css class — developer, May 23 '18 at 16:24
Uh oh. Someone is parsing HTML with a regex. Seriously, though, maybe look for something like JSoup for JS. Unless I'm misunderstanding this. — , May 23 '18 at 16:26
How do you tell if an HTML tag is an inner tag or not? In your example, you said you wanted `
This is some[...] to extract.It can be complicated.
` to become `This is some text which I need to extract."/"It can be complicated."`. But after you say you want `
Click here
` to become `"Click here"`. — Ivan, May 23 '18 at 16:26
[Inevitable link](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454). — T.J. Crowder, May 23 '18 at 16:30
If your starting point really a string with HTML in it, or is your starting point a document? Are you doing this in a browser? (You've tagged [tag:jquery], so it seems likely, but...) — T.J. Crowder, May 23 '18 at 16:31
@T.J.Crowder my staring point will be a string with HTML in it and I will remove the jQuery tag. Thanks! — dk111989, May 23 '18 at 16:36
What environment are you doing this in, then? Node.js? The JVM? A Windows Universal App? — T.J. Crowder, May 23 '18 at 16:36
I will be using Node.js to create something like a microservice for translation @T.J.Crowder — dk111989, May 23 '18 at 16:39
Thank you so much @T.J.Crowder for guiding me through this process. Definitely learned something new today but this does the same thing as my snippet was doing. My goal is to get `It can be complicated.` together. — dk111989, May 23 '18 at 23:42
@dk111989 - Because you don't want to translate "it can be" and "complicated" separately? That's going to be (no pun) complicated. :-) How would you know what part of the translated text goes in the `strong` element? — T.J. Crowder, May 24 '18 at 07:03
@T.J.Crowder Yes I don't wanna loose the context of sentence. What I am thinking of doing is parse the HTML now and see if most commonly used tags for text like a,p,span,h,li etc has any child tag. If yes, then keep them as is with HTML markup and then splitting the inner text with period so I get separate sentences. My goal is not to make it 100% working but provide as much coverage possible because this is to help someone with their translation process but not used as a business tool. — dk111989, May 24 '18 at 15:48

T.J. Crowder · Answer 1 · 2018-05-23T16:55:08.070

By far, the best way to do this is by using an HTML parser, then looping through the text nodes in the tree. You cannot correctly handle a non-regular markup language like HTML with just simple JavaScript regular expressions¹ (many have wasted a lot of time trying), and that's not even taking into account all of HTML's specific peculiarities.

There are at least a couple, probably several, well-tested, actively-supported DOM parser modules available on npm.

So the basic structure would be:

Parse the HTML into a DOM.
Walk the DOM in a defined order (typically depth-first traversal) building up your object or array of text strings to translate from the text nodes you encounter.
Convert that object/array to JSON if necessary, send it off for translation, get the result back, parse it from JSON into an object/array again if necessary.
Walk the DOM in the same order, applying the results from the object/array.
Serialize the DOM to HTML.
Send the result.

Here's an example — naturally here I'm using the HTML parser built into the browser rather than an npm module, and the API to whatever module you're using may be slightly different, but the concept is the same:

var html = '<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';
var dom = parseHTML(html);
var strings = [];
walk(dom, function(node) {
  if (node.nodeType === 3) { // text node
    strings.push(node.nodeValue);
  }
});
console.log("strings = ", strings);
var translation = translate(strings);
console.log("translation = ", translation);
var n = 0;
walk(dom, function(node) {
  if (node.nodeType === 3) { // text node
    node.nodeValue = translation[n++];
  }
});
var newHTML = serialize(dom);
document.getElementById("before").innerHTML = html;
document.getElementById("after").innerHTML = newHTML;


function translate(strings) {
  return strings.map(str => str.toUpperCase());
}

function walk(node, callback) {
  var child;
  callback(node);
  switch (node.nodeType) {
    case 1: // Element
      for (child = node.firstChild; child; child = child.nextSibling) {
        walk(child, callback);
      }
  }
}

// Placeholder for module function
function parseHTML(html) {
  var div = document.createElement("div");
  div.innerHTML = html;
  return div;
}

// Placeholder for module function
function serialize(dom) {
  return dom.innerHTML;
}

<strong>Before:</strong>
<div id="before"></div>
<strong>After:</strong>
<div id="after"></div>

¹ Some "regex" libs (or regex features of other languages) are really regex+more features that can help you do something similar, but they're not just regex, and JavaScript's built-in ones don't have those features.

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

If anyone is looking to do something like this then I created this translation service here

https://github.com/gurusewak/translation

My goal was not to achieve 100% success rate in breaking the sentences but to get as many sentences as possible. I was just trying to help someone to do translation when given some html as an input. Hopefully, this might help someone in some way in future.

Cheers !

Output

Output of the flow here