1

In one of my projects I need to remove all unusual characters other than a-Z and 1-9. I found a way of doing this, but I think my solution is ugly and not efficient.

What would be ways to improve my solution to make it more efficient?

public static String removeSpecialCharactersAndHTML(String text) {
    String result = text;

    result = result.replace(">", ">");
    result = result.replace("&lt;", "<");
    result = result.replace("&#38;", "&");
    result = result.replace("&quot;", "\"");
    result = result.replace("&nbsp;", " ");
    result = result.replace("&amp;", "&");

    result = result.replace("]]>", "");
    result = result.replace("‘", "'");
    result = result.replace("’", "'");
    result = result.replace("`", "'");
    result = result.replace("´", "'");
    result = result.replace("“", "\"");

    // .....

    result = result.replace("”", "\"");
    result = result.replace("³", "3");
    result = result.replace("²", "2");

    return result 
}
Sophie
  • 193
  • 1
  • 7
  • There are more characters that are special than non special. What characters *can* it handle? – Bohemian Sep 21 '16 at 17:36
  • 1
    Paste your code in the question in text form with proper formatting. Do not provide links to external sources such as github. – progyammer Sep 21 '16 at 17:37
  • I think [that](http://stackoverflow.com/a/10574318/1402861) may answer your question ; ) – WrRaThY Sep 21 '16 at 17:40
  • Basically any characters that were around in the 90's (a-Z, A-Z, numbers and some basic special characters, like !@#$%^&*() I probably should keep track of a list of allowed characters, and replace any characters not in that list. – Sophie Sep 21 '16 at 17:41

2 Answers2

1

For removing HTML from a string, you should not write your own code but instead use some existing library. They will not do the many bugs that are in your code.

The approach of replacing certain characters is fine. But at the end, you must remove all characters from the string that will not be handled by the terminal. That is, rather than defining the forbidden characters, define the allowed characters.

Roland Illig
  • 40,703
  • 10
  • 88
  • 121
0

You can use the below approach if you need to remove whitespace as well:

 result = result.replaceAll("[^a-zA-Z0-9]", "");

If you want to have whitespace in your string, you may use this approach:

result = result.replaceAll("[^a-zA-Z0-9\\s]", "");

It is also recommended to go with a third-party library available. You may use this as well

https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html#escapeHtml4-java.lang.String-

Thiluxan
  • 177
  • 4
  • 13