0

I'm trying to get text from HTML with it's properties (bold, underlined, italic, superscript etc.) but I am struggling with nested ones (like <b> Lorem <u> Ipsum </u></b>, in this case Lorem should be bold and Ipsum should be bold and underlined).

Example Data

<p> Normal<b>Bold</b> <b>Bold<u>Underlined</u></b> <b><i>Bold Italic</i></b><p/>

I need to use this texts in Indesign Script and I need to assign character styles for these properties. Is there any tool or technique for PHP or Javascript that I can use?

  • What is the desired output of the script? Will the script run from within Indesign? If so, are you sure PHP can be run that way? – Petr 'PePa' Pavel Feb 11 '22 at 16:34
  • Hi, I'm not sure what you are looking for. Does the [Node.nodeName](https://developer.mozilla.org/fr/docs/Web/API/Node/nodeName) JavaScript property correspond to what you want? – Baptiste Rieber Feb 11 '22 at 16:37
  • Does this help? https://stackoverflow.com/questions/494143/creating-a-new-dom-element-from-an-html-string-using-built-in-dom-methods-or-pro It describes how to get a DOM tree structure of objects that represent tags and texts parsed from a HTML string. – Petr 'PePa' Pavel Feb 11 '22 at 16:38
  • @Petr'PePa'Pavel I will get HTML data from PHP file so I can create Javascript file with PHP and execute it. Both Javascript and PHP is good for me. Desired output can be object or array includes words or parts with its property but they should be in order to use. I tried to write a function which gives parts with property. Function can return parts. – Mango Kafa Feb 11 '22 at 16:42

1 Answers1

0

Try if DOMParser is available in the environment where you're going to run your JS code.

This parses the html string and outputs a tree structure of the nodes and their texts.

const htmlString = '<p> Normal<b>Bold</b> <b>Bold<u>Underlined</u></b> <b><i>Bold Italic</i></b><p/>';

const htmlElement = (new DOMParser().parseFromString(htmlString, 'text/html')).firstChild.childNodes[1].firstChild;

const tree = convertDomToArray(htmlElement);

console.log(tree);


function convertDomToArray(element) {
  if (element.nodeName === '#text') {
    return element.nodeValue;
  }
  
  let children = [];
  for (let childElement of element.childNodes) {
    children.push(convertDomToArray(childElement));
  }
  
  let output = {};
  output[element.nodeName] = children;
  
  return output;
}
  • Thanks for your answer. DOM returns nodes as keys and elements as values. Also they are nested. I can't get if an item has both u and b property. Is there any way I can get an output like `Array ( 'Normal', 'B' : 'Bold', 'B' : 'Bold', 'BI' : 'Underlined'..., 'PROPERTIES' : 'TEXT' )` – Mango Kafa Feb 12 '22 at 08:27
  • I understand you want to ignore all HTML tags that aren't B, I or U and then for texts, show their parent tags combined. This can certainly be done with some more programming. At this point, I'm reluctant to continue for free because I feel this is outside the scope of StackOverflow. I understand SO is here to educate, not for free programming. Please contact me at petr.pavel@pepa.info if you're okay with paying for this job. – Petr 'PePa' Pavel Feb 12 '22 at 11:42