2

I was pointed out to this post, which does not seem to follow the criteria I have: Replace a Regex capture group with uppercase in Javascript

I am trying to make a regex that will:

  • format a string by adding uppercase for the first letter of each word and lower case for the rest of the characters
  • ignore HTML markup
  • Accept swedish characters (åäöÅÄÖ)

Say I've got this string:

<b>app</b>le store östersund

Then I want it to be (changes marked by uppercase characters)

<b>App</b>le Store Östersund

I've been playing around with it and the closest I've got is the following:

(?!([^<])*?>)[åäöÅÄÖ]|\s\b\w

Resulted in

<b>app</b>le Store Östersund

Or this

/(?!([^<])*?>)[åäöÅÄÖ]|\S\b\w/g

Resulted in

<B>App</B>Le store Östersund
Thom A
  • 88,727
  • 11
  • 45
  • 75
faerin
  • 1,915
  • 17
  • 31

2 Answers2

2

It is not possible to do this with regexp alone, since regexp doesn't understand HTML structure. [*] Instead, we need to process each text node, and carry through our logic for what is the beginning of the word in case a word continues across different text nodes. A character is at start of the word if it is preceded by a whitespace, or if it is at the start of the string and it is either the first text node, or the previous text node ended in whitespace.

function htmlToTitlecase(html, letters) {
  let div = document.createElement('div');
  let re = new RegExp("(^|\\s)([" + letters + "])", "gi");
  div.innerHTML = html;
  let treeWalker = document.createTreeWalker(div, NodeFilter.SHOW_TEXT);
  let startOfWord = true;
  while (treeWalker.nextNode()) {
    let node = treeWalker.currentNode;
    node.data = node.data.replace(re, function(match, space, letter) {
      if (space || startOfWord) {
        return space + letter.toUpperCase();
      } else {
        return match;
      }
    });
    startOfWord = node.data.match(/\s$/);
  }
  return div.innerHTML;
}

console.log(htmlToTitlecase("<b>app</b>le store östersund", "a-zåäö"));
// <b>App</b>le Store Östersund

[*] Maybe possible, but even if so, it would be horribly ugly, since it would need to cover an awful amount of corner cases. Also might need a stronger RegExp engine than JavaScript's, like Ruby's or Perl's.

EDIT:

Even if just specifying really simple html tags? The only ones I am actually in need of covering is <b> and </b> at the moment.

This was not specified in the question. The solution is general enough to work for any markup (including simple tags). But...

function simpleHtmlToTitlecaseSwedish(html) {
  return html.replace(/(^|\s)(<\/?b>|)([a-zåäö])/gi, function(match, space, tag, letter) {
    return space + tag + letter.toUpperCase();
  });
}
console.log(simpleHtmlToTitlecaseSwedish("<b>app</b>le store östersund", "a-zåäö"));
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • Even if just specifying really simple html tags? The only ones I am actually in need of covering is and at the moment. – faerin Aug 09 '17 at 09:04
  • Oh hold on, just saw your edited answer. I will check it out. Thanks for your effort. – faerin Aug 09 '17 at 09:17
1

I have a solution which use almost only regex. It may be not the most intuitive way to do it, but it should be effective and I find it funny :)

You have to append at the end of your string every lowercase character followed by their uppercase counterpart, like this (it must also be preceded by a space for my regex) :
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ
(I don't know which letters are missing, I know nothing about swedish alphabet, sorry... I'm counting on you to correct that !)

Then you can use the following regex :
(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$
Replace by :
$1$3

Test it here


Here is a working javascript code :

// Initialization
var regex = /(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$/g;
var string = "test <b when=\"2>1\">ap<i>p</i></b>le store östersund";

// Processing
result = string + " aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ";
result = result.replace(regex, "$1$3");

// Display result
console.log(result);

Edit : I forgot to handle first word of the string, it's corrected :)

Gawil
  • 1,171
  • 6
  • 13
  • It should handle any html tag as long as they are not incomplete. – Gawil Aug 09 '17 at 09:11
  • Wow, that's almost perfect! It fails however if say `apple Store Östersund`. The first a should've been capitalized too :) For the record, you nailed the swedish alphabet. – faerin Aug 09 '17 at 09:13
  • @entiendoNull Yeah I just saw that, it should be ok now :) – Gawil Aug 09 '17 at 09:14
  • Perfect! Thanks for your effort! :) – faerin Aug 09 '17 at 09:15
  • @entiendoNull Anytime ! I had fun writing this regex :) Just for the record, if you need to add letters to the alphabet, add them to the appended string, and to both sets of brackets `[\wåäö]` (lowercase only) and `[\wåÅäÄöÖ]` (both lowercase and uppercase) – Gawil Aug 09 '17 at 09:18
  • "It should handle any html tag as long as they are not incomplete": `apple store östersund`? `apple`? HTML is way too tricky to do by regexp, even clever regexp. – Amadan Aug 09 '17 at 09:19
  • @Adaman Thanks for the comment, I changed the regex to handle those cases :) – Gawil Aug 09 '17 at 09:25