Uppercase for each new word swedish characters and html markup

Question

I was pointed out to this post, which does not seem to follow the criteria I have: Replace a Regex capture group with uppercase in Javascript

I am trying to make a regex that will:

format a string by adding uppercase for the first letter of each word and lower case for the rest of the characters
ignore HTML markup
Accept swedish characters (åäöÅÄÖ)

Say I've got this string:

<b>app</b>le store östersund

Then I want it to be (changes marked by uppercase characters)

<b>App</b>le Store Östersund

I've been playing around with it and the closest I've got is the following:

(?!([^<])*?>)[åäöÅÄÖ]|\s\b\w

Resulted in

<b>app</b>le Store Östersund

Or this

/(?!([^<])*?>)[åäöÅÄÖ]|\S\b\w/g

Resulted in

<B>App</B>Le store Östersund

Amadan · Accepted Answer · 2017-08-09T09:20:54.613

It is not possible to do this with regexp alone, since regexp doesn't understand HTML structure. [*] Instead, we need to process each text node, and carry through our logic for what is the beginning of the word in case a word continues across different text nodes. A character is at start of the word if it is preceded by a whitespace, or if it is at the start of the string and it is either the first text node, or the previous text node ended in whitespace.

function htmlToTitlecase(html, letters) {
  let div = document.createElement('div');
  let re = new RegExp("(^|\\s)([" + letters + "])", "gi");
  div.innerHTML = html;
  let treeWalker = document.createTreeWalker(div, NodeFilter.SHOW_TEXT);
  let startOfWord = true;
  while (treeWalker.nextNode()) {
    let node = treeWalker.currentNode;
    node.data = node.data.replace(re, function(match, space, letter) {
      if (space || startOfWord) {
        return space + letter.toUpperCase();
      } else {
        return match;
      }
    });
    startOfWord = node.data.match(/\s$/);
  }
  return div.innerHTML;
}

console.log(htmlToTitlecase("<b>app</b>le store östersund", "a-zåäö"));
// <b>App</b>le Store Östersund

[*] Maybe possible, but even if so, it would be horribly ugly, since it would need to cover an awful amount of corner cases. Also might need a stronger RegExp engine than JavaScript's, like Ruby's or Perl's.

EDIT:

Even if just specifying really simple html tags? The only ones I am actually in need of covering is <b> and </b> at the moment.

This was not specified in the question. The solution is general enough to work for any markup (including simple tags). But...

function simpleHtmlToTitlecaseSwedish(html) {
  return html.replace(/(^|\s)(<\/?b>|)([a-zåäö])/gi, function(match, space, tag, letter) {
    return space + tag + letter.toUpperCase();
  });
}
console.log(simpleHtmlToTitlecaseSwedish("<b>app</b>le store östersund", "a-zåäö"));

Even if just specifying really simple html tags? The only ones I am actually in need of covering is and at the moment. — faerin, Aug 09 '17 at 09:04
Oh hold on, just saw your edited answer. I will check it out. Thanks for your effort. — faerin, Aug 09 '17 at 09:17

Gawil · Answer 2 · 2017-08-09T09:34:13.267

1

I have a solution which use almost only regex. It may be not the most intuitive way to do it, but it should be effective and I find it funny :)

You have to append at the end of your string every lowercase character followed by their uppercase counterpart, like this (it must also be preceded by a space for my regex) :
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ
(I don't know which letters are missing, I know nothing about swedish alphabet, sorry... I'm counting on you to correct that !)

Then you can use the following regex :
(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$
Replace by :
$1$3

Test it here

Here is a working javascript code :

// Initialization
var regex = /(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$/g;
var string = "test <b when=\"2>1\">ap<i>p</i></b>le store östersund";

// Processing
result = string + " aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ";
result = result.replace(regex, "$1$3");

// Display result
console.log(result);

Edit : I forgot to handle first word of the string, it's corrected :)

edited Aug 09 '17 at 09:34

answered Aug 09 '17 at 09:07

Gawil

1,171
6
13

It should handle any html tag as long as they are not incomplete. – Gawil Aug 09 '17 at 09:11
Wow, that's almost perfect! It fails however if say `apple Store Östersund`. The first a should've been capitalized too :) For the record, you nailed the swedish alphabet. – faerin Aug 09 '17 at 09:13
@entiendoNull Yeah I just saw that, it should be ok now :) – Gawil Aug 09 '17 at 09:14
Perfect! Thanks for your effort! :) – faerin Aug 09 '17 at 09:15
@entiendoNull Anytime ! I had fun writing this regex :) Just for the record, if you need to add letters to the alphabet, add them to the appended string, and to both sets of brackets `[\wåäö]` (lowercase only) and `[\wåÅäÄöÖ]` (both lowercase and uppercase) – Gawil Aug 09 '17 at 09:18
"It should handle any html tag as long as they are not incomplete": `apple store östersund`? `apple`? HTML is way too tricky to do by regexp, even clever regexp. – Amadan Aug 09 '17 at 09:19
@Adaman Thanks for the comment, I changed the regex to handle those cases :) – Gawil Aug 09 '17 at 09:25

Uppercase for each new word swedish characters and html markup

2 Answers2