1

How do I convert string to upper case String.toUpperCase() ignoring special characters like   and all others. The problem is that it becomes   and browser does not recognize them as special HTML characters.

I came up with this but it does not cover all special characters:

public static String toUpperCaseIgnoreHtmlSymbols(String str){
    if(str == null) return "";
        str = str.trim();
    str = str.replaceAll("(?i) "," ");
    str = str.replaceAll(""",""");
    str = str.replaceAll("&","&");
    //etc.
    str = str.toUpperCase();
    return str;
}
Arnout Engelen
  • 6,709
  • 1
  • 25
  • 36
Vad
  • 3,658
  • 8
  • 46
  • 81

3 Answers3

3

Are you only interested in skipping HTML Entities, or do you also want to skip tags? What about chunks of javascript? URL's in links?

If you need to support that kind of stuff, you won't be able to avoid using a 'real' HTML parser instead of a regex. For example, parse the document using jsoup, manipulate the resulting Document, and convert it back to HTML:

private String upperCase(String str) {
    Document document = Jsoup.parse(str);
    upperCase(document.body());
    return document.html();
}

private void upperCase(Node node) {
    if (node instanceof TextNode) {
        TextNode textnode = (TextNode) node;
        textnode.text(textnode.text().toUpperCase());
    }
    for (Node child : node.childNodes()) {
        upperCase(child);
    }
}

now:

upperCase("This is some <a href=\"http://arnout.engelen.eu\">text&nbsp;with&nbsp;entities</a>");

will produce:

<html>
  <head></head>
  <body>
    THIS IS SOME 
    <a href="http://arnout.engelen.eu">TEXT&nbsp;WITH&nbsp;ENTITIES</a>
  </body>
</html>
Arnout Engelen
  • 6,709
  • 1
  • 25
  • 36
  • Can I do the same without Jsoup? – Vad Aug 23 '12 at 16:09
  • Well, yes, but you'd need to include or write some other HTML parser. JSoup is lightweight, high quality, well-tested and released under a permissive license. Doing something like this correctly yourself is nontrivial. Not sure what more you could ask for :). – Arnout Engelen Aug 23 '12 at 16:12
0

You could split the string in different groups with this regex

(.+?)(&[^ ]+?;)

The first part matches text before the special character, the second part matches the special character.

Once you have done that you can convert to uppercase the first group only, repeating for all the matches of the string.

Gabber
  • 5,152
  • 6
  • 35
  • 49
  • This is simpler than my solution, but technically fails the corner-case where someone has `&fooooo;` in their HTML. Since that's not a defined entity, it ought to be interpreted as plain text (and thus uppercased by Vad's code). – Nathaniel Waisbrot Aug 23 '12 at 15:08
0

I think you have the right idea, replacing all named entities with their numeric equivalents.

Here's the W3C's list of entities for HTML4: http://www.w3.org/TR/html4/sgml/entities.html

You could format that into a single two-column table without too much work. (Note that there's three tables at that link.) I'd do that, then read the table in and you can easily convert from named to numeric and back.

Nathaniel Waisbrot
  • 23,261
  • 7
  • 71
  • 99