How to sanitize String from \n \t etc.?

Question

This is my example String:

"Hello\n I am\t \n \n Marco\t\n"

I want to remove all decoded white characters. Is there any generic solution that will not only work with \n \t?

Yea, but with regex I need to know all characters but i'm not sure about their pool — pawel033, Nov 05 '20 at 14:30
@pawel033 - Do you any problem with `\s+` as mentioned in [this answer](https://stackoverflow.com/a/64699473/10819573)? — Arvind Kumar Avinash, Nov 06 '20 at 08:22
@ArvindKumarAvinash yes, I have https://paste.pics/ALVMM , I think It might be related to the input but I dont know what is the cause — pawel033, Nov 06 '20 at 09:37

CryptoFool · Answer 1 · 2020-11-05T14:57:02.613

This replaces runs of characters that are not word characters with a single space. You don't have to know what characters you don't want. You just say which ones you do want:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Test {
    public static void main(String[]args) {

        String data = "Hello\n I am\t \n \n Marco\t\n";

        data = data.replaceAll("[^\\w]+", " ");

        System.out.println(data);
    }
}

Result:

Hello I am Marco

The regular expression "[^\\w]+" says to match groups of characters that are not word characters. Word characters are A-Z, a-z, 0-9 and "_". The call to replaceAll says to replace each of these groups of characters with a single space character.

You have other options, if this isn't exactly what you wanted, by tweaking the regular expression and the replacement string. You could, for example, leave the spaces in with the expression "[^\\w ]+", and change the replacement string to "", but then you'll have multiple spaces between some of your words.

You can add other characters to the list of characters that are not removed by adding them to the "[^\\w]+" expression.

You can simplify the regex in pattern to `\\W+` (`\\W` == `[^\\w]`) — nkrivenko, Nov 05 '20 at 14:40
You could improve your answer by explaining what you are doing and why. For example what '\w' means, why you are replacing every match with " ", etc. — lugiorgi, Nov 05 '20 at 14:41

score 0 · Answer 2 · answered Nov 05 '20 at 14:42

Simply replace all whitespace (i.e. \s+) with "".

public class Main {
    public static void main(String[] args) {
        String str = "Hello\n I am\t \n \n Marco\t\n";
        str = str.replaceAll("\\s+", "");
        System.out.println(str);
    }
}

Output:

HelloIamMarco

score 0 · Answer 3 · answered Nov 05 '20 at 14:53

You can also use java streams, which I consider more readable:

String noWhitespace = "Hello\n I am\t \n \n Marco\t\n".chars()
                            .filter(c -> !Character.isWhitespace(c))
                            .collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
                            .toString();

score 0 · Answer 4 · answered Nov 05 '20 at 15:13

I haven't had great luck with using regular expressions to handle whitespace in Java (I disagree with Java on the definition of whitespace, and it gets weird when you start dealing with Unicode characters). For fine-grained control, I use the following:

public static String strip(final String text)
{
    if ((text == null) || (text.length() == 0))
    {
        return text; // nothing to do
    }

    final StringBuilder str = new StringBuilder();

    for (char c : text.toCharArray())
    {
        switch (c)
        {
            // https://stackoverflow.com/a/4731164/2074605
            case ' ':  // '\u0020' SPACE
            case '\t': // '\u0009' CHARACTER TABULATION
            case '\n':
            case '\r':
            case '\f': // '\u000c'
            case '\u00a0': // NO-BREAK SPACE
            case '\u2002': // EN SPACE
            case '\u2003': // EM SPACE
            case '\u2009': // THIN SPACE
            case '\u200a': // HAIR SPACE
            case '\u000b': // vertical tab
            {
                break;
            }
            default:
            {
                str.append(c);
                break;
            }
        }
    }

    return str.toString();
}

This approach also lends itself to easily building other core string utilities (trim, normalize, etc.).

For example:

/**
 * Normalizes text. This replaces multiple white spaces with a single character.
 * This preserves the first whitespace character but ignores following whitespace until a non-whitespace character is encountered.
 *
 * @param text The text to normalize.
 * @return The normalized text.
 */
public static String normalize(final String text)
{
    if (text == null)
    {
        return null;
    }

    final StringBuilder strbuf = new StringBuilder();

    boolean previousSpace = false;
    for (char c : text.toCharArray())
    {
        switch (c)
        {
            // https://stackoverflow.com/a/4731164/2074605
            case ' ':  // '\u0020' SPACE
            case '\t': // '\u0009' CHARACTER TABULATION
            case '\n':
            case '\r':
            case '\f': // '\u000c'
            case '\u00a0': // NO-BREAK SPACE
            case '\u2002': // EN SPACE
            case '\u2003': // EM SPACE
            case '\u2009': // THIN SPACE
            case '\u200a': // HAIR SPACE
            case '\u000b': // vertical tab
            {
                if (!previousSpace)
                {
                    strbuf.append(c);
                }
                previousSpace = true;
                break;
            }
            default:
            {
                strbuf.append(c);
                previousSpace = false;
                break;
            }
        }
    }

    return strbuf.toString();
}

And:

/**
 * Trims leading and trailing whitespace.
 * This method understands more forms of white space than String.trim().
 *
 * @param text The text to trim.
 * @return The trimmed text.
 */
public static String trim(final String text)
{
    if ((text == null) || (text.length() == 0))
    {
        return text; // nothing to do
    }

    // Find the first and last non-space characters in the text.
    Integer firstNonSpaceIdx = null;
    Integer lastNonSpaceIdx = null;

    int currentIdx = 0;

    for (char c : text.toCharArray())
    {
        switch (c)
        {
            // https://stackoverflow.com/a/4731164/2074605
            case ' ':  // '\u0020' SPACE
            case '\t': // '\u0009' CHARACTER TABULATION
            case '\n':
            case '\r':
            case '\f': // '\u000c'
            case '\u00a0': // NO-BREAK SPACE
            case '\u2002': // EN SPACE
            case '\u2003': // EM SPACE
            case '\u2009': // THIN SPACE
            case '\u200a': // HAIR SPACE
            case '\u000b': // vertical tab
            {
                break;
            }
            default:
            {
                if (firstNonSpaceIdx == null)
                {
                    firstNonSpaceIdx = currentIdx;
                }

                lastNonSpaceIdx = currentIdx;
                break;
            }
        }

        ++currentIdx;
    }

    if (firstNonSpaceIdx == null)
    {
        return text; // nothing to do
    }

    return text.substring(firstNonSpaceIdx, lastNonSpaceIdx + 1);
}

And:

/**
 * Normalizes text. This replaces multiple white spaces with a single space character.
 * It also trims any whitespace from the beginning and end of the string.
 *
 * @param text The text to normalize.
 * @return The normalized text.
 */
public static String whitespaceToSingleSpace(final String text)
{
    if (text == null)
    {
        return null;
    }

    final StringBuilder strbuf = new StringBuilder();

    boolean previousSpace = false;
    for (char c : text.toCharArray())
    {
        switch (c)
        {
            // https://stackoverflow.com/a/4731164/2074605
            case ' ':  // '\u0020' SPACE
            case '\t': // '\u0009' CHARACTER TABULATION
            case '\n':
            case '\r':
            case '\f': // '\u000c'
            case '\u00a0': // NO-BREAK SPACE
            case '\u2002': // EN SPACE
            case '\u2003': // EM SPACE
            case '\u2009': // THIN SPACE
            case '\u200a': // HAIR SPACE
            case '\u000b': // vertical tab
            {
                if (!previousSpace)
                {
                    strbuf.append(' ');
                }
                previousSpace = true;
                break;
            }
            default:
            {
                strbuf.append(c);
                previousSpace = false;
                break;
            }
        }
    }

    return trim(strbuf.toString());
}

score 0 · Answer 5 · answered Nov 05 '20 at 16:16

I have this in my toolbox class:

/**
     * This method formats a String. <br>
     * <br>
     * It places the first non-white space character at the left, and removes all extra spaces. <br>
     * So "&nbsp;a&nbsp;bc&nbsp;&nbsp;&nbsp;cd" will be returned as "a&nbsp;bc&nbsp;cd"
     * @param format
     */
    public static String stringLeftJustify( String theValue, JustifyFormat format )
    {
        char charArray[];

        try
        {
            charArray = theValue.toCharArray();
        }
        catch (NullPointerException e)
        {
            return "";
        }

        StringBuilder out = new StringBuilder( charArray.length + 1 );

        // remove any leading whitespace
        boolean isSpace = true;

        for (int c = 0; c < charArray.length; c++)
        {
            if (format == JustifyFormat.MULTI_LINE)
            {
                // leave CRLF for multi-line inputs
                if (!(charArray[c] == '\n' || charArray[c] == '\r') && Character.isWhitespace( charArray[c] ))
                {
                    if (!isSpace)
                        out.append( ' ' );

                    isSpace = true;
                }
                else
                {
                    out.append( charArray[c] );
                    isSpace = false;
                }
            }
            else
            {
                if (Character.isWhitespace( charArray[c] ))
                {
                    if (!isSpace)
                        out.append( ' ' );

                    isSpace = true;
                }
                else
                {
                    out.append( charArray[c] );
                    isSpace = false;
                }
            }
        }

        // remove trailing space
        if (isSpace && out.length() > 0)
        {
            String justified = out.toString();

            return justified.substring( 0, justified.length() - 1 );
        }

        return out.toString();
    }

How to sanitize String from \n \t etc.?

5 Answers5