Regex Issue for Title Case on String Containing HTML Markup

Question

Currently I'm running the following replacement approach ...

const str = '<span style="font-weight:bold;color:Blue;">ch</span>edilpakkam,tiruvallur';
const rex = (/(\b[a-z])/g);
 
const result = str.toLowerCase().replace(rex, function (letter) {
  //console.log(letter.toUpperCase())
  return letter.toUpperCase();
});

console.log(result);

.as-console-wrapper { min-height: 100%!important; top: 0; }

... with a source of ...

<span style="font-weight:bold;color:Blue;">ch</span>edilpakkam,tiruvallur

... and the following result ...

<Span Style="Font-Weight:Bold;Color:Blue;">Ch</Span>Edilpakkam,Tiruvallur

But what I want to achieve are the following points ...

Bind span to string.
Uppercase 1st letter and word after.
Expected output

<span style="font-weight:bold;color:Blue;">Ch</span>edilpakkam,Tiruvallur

[Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. — Toto, Mar 09 '21 at 08:35
Please post the full relevant code, it is not clear how come you get `ChEdilpakkam, Tiruvallur` and not `ChEdilpakkam,Tiruvallur` (that I get after running your code). — Wiktor Stribiżew, Mar 09 '21 at 11:04
@ArunvairavanV ... are there any questions left regarding all the given answers? — Peter Seliger, Mar 11 '21 at 10:53

Peter Seliger · Answer 1 · 2021-03-09T18:50:41.190

Toto already commented on the difficulties of "parsing" HTML code via regex.

The following generic (markup agnostic) approach makes use of a sandbox like div element in order to benefit from its DOM parsing/accessing capabilities.

First, one needs to collect all text-nodes of the temporary sandbox. Then, for each text-node's textContent, one has to decide whether to start with capitalizing all words from a string's beginning or not.

The cases for capitalizing every word within a string including the first occurring one are ...

The text-node's previous sibling either does not exist ...
... or is a block-level element.
The text-node itself starts with a whitespace(-sequence).

For all other cases one wants to capture/capitalize every first word character after a word boundary too ... except for the word at the beginning of a line.

function collectContentTextNodesRecursively(list, node) {
  return list.concat(
    (node.nodeType === 1) // element-node?

    ? Array
      .from(node.childNodes)
      .reduce(collectContentTextNodesRecursively, [])

    : (node.nodeType === 3) // text-node?
      ? node
      : []
  );
}

function getNodeSpecificWordCapitalizingRegex(textNode) {
  const prevNode = textNode.previousSibling;
  const isAssumeBlockBefore = (prevNode === null) || (/^(?:address|article|aside|blockquote|details|dialog|dd|div|dl|dt|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|header|hgroup|hr|li|main|nav|ol|p|pre|section|table|ul)$/g).test(prevNode.nodeName.toLowerCase());

  //     either assume a previous block element, or the current text starts with whitespace.
  return (isAssumeBlockBefore || (/^\s+/).test(textNode.textContent))

    // capture every first word character after word boundary.
    ? (/\b(\w)/g)
    // capture every first word character after word boundary except at beginning of line.
    : (/(?<!^)\b(\w)/g);
}


function capitalizeEachTextContentWordWithinCode(code) {
  const sandbox = document.createElement('div');
  sandbox.innerHTML = code;

  collectContentTextNodesRecursively([], sandbox).forEach(textNode => {

    textNode.textContent = textNode.textContent.replace(
      getNodeSpecificWordCapitalizingRegex(textNode),
      (match, capture) => capture.toUpperCase()
    ); 
  });
  return sandbox.innerHTML; 
}


const htmlCode = [
  '<span style="font-weight:bold;color:blue;">ch</span>edilpakkam,tiruvallur, chedilpakkam,tiruvallur',
  '<span style="font-weight:bold;color:blue;">ch</span> edilpakkam,tiruvallur, chedilpakkam,tiruvallur',
  '<span style="font-weight:bold;color:blue;">ch</span> edilpakkam, tiruvallur,chedilpakkam, tiruvallur',
  '<span style="font-weight:bold;color:blue;">ch</span>edilpakkam, tiruvallur,chedilpakkam, tiruvallur',
].join('<br\/>');

document.body.innerHTML = capitalizeEachTextContentWordWithinCode(htmlCode);

console.log(document.body.innerHTML.split('<br>'));

.as-console-wrapper { max-height: 57%!important; }

Jobelle · Answer 2 · 2021-03-10T10:50:18.857

0

Try the below

CheckThis

function formatText(str) {
  var res = str.replace(/(\b[a-z])/gi, function(match, $1){
   return $1.toUpperCase();
  }).replace(/^([a-z]{2})(.*)/gim, "<span style='font-weight:bold;color:Blue;'>$1</span>$2");
 return res;
}

edited Mar 10 '21 at 10:50

answered Mar 09 '21 at 12:27

Jobelle

2,717
1
15
26

You quoted it correctly. But the OP wants to achieve this task by somehow *"parsing"* a string of html code like `'chedilpakkam,tiruvallur'`. Your approach takes a string like `'chedilpakkam, tiruvallur'`, does process the uppercase task and then, in a non generic process, slices exactly the first 2 letters, wraps them with a predefined html code and appends the rest of the string. Thus this approach fails if the OP wants to process code like `'chedilpakkam,tiruvallur'` (`che` instead of `ch` and style changes). – Peter Seliger Mar 10 '21 at 11:17

Regex Issue for Title Case on String Containing HTML Markup

2 Answers2