45

The JDK's String.trim() method is pretty naive, and only removes ascii control characters.

Apache Commons' StringUtils.strip() is slightly better, but uses the JDK's Character.isWhitespace(), which doesn't recognize non-breaking space as whitespace.

So what would be the most complete, Unicode-compatible, safe and proper way to trim a string in Java?

And incidentally, is there a better library than commons-lang that I should be using for this sort of stuff?

Community
  • 1
  • 1
itsadok
  • 28,822
  • 30
  • 126
  • 171

6 Answers6

61

Google has made guava-libraries available recently. It may have what you are looking for:

CharMatcher.inRange('\0', ' ').trimFrom(str)

is equivalent to String.trim(), but you can customize what to trim, refer to the JavaDoc.

For instance, it has its own definition of WHITESPACE which differs from the JDK and is defined according to the latest Unicode standard, so what you need can be written as:

CharMatcher.WHITESPACE.trimFrom(str)
CrazyCoder
  • 389,263
  • 172
  • 990
  • 904
  • 1
    Tip: [`trimAndCollapseFrom`](http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/CharMatcher.html#trimAndCollapseFrom(java.lang.CharSequence,%20char)) trims the outside of the string while also replacing duplicate values inside the string. – Basil Bourque Mar 07 '15 at 05:59
8

I swear I only saw this after I posted the question: Google just released Guava, a library of core Java utilities.

I haven't tried this yet, but from what I can tell, this is fully Unicode compliant:

String s = "  \t testing \u00a0"
s = CharMatcher.WHITESPACE.trimFrom(s);
itsadok
  • 28,822
  • 30
  • 126
  • 171
  • 2
    Haha, I've provided the same answer just 5 minutes earlier, but then edited it to include the exact code you need to use, and just then saw your comment that you found it yourself. – CrazyCoder Sep 17 '09 at 11:20
3

It's really hard to define what constitutes white spaces. Sometimes I use non-breakable spaces just to make sure it doesn't get stripped. So it will be hard to find a library to do exactly what you want.

I use my own trim() if I want trim every white space. Here is the function I use to check for white spaces,

  public static boolean isWhitespace (int ch)
  {
    if (ch == ' ' || (ch >= 0x9 && ch <= 0xD))
      return true;
    if (ch < 0x85) // short-circuit optimization.
      return false;
    if (ch == 0x85 || ch == 0xA0 || ch == 0x1680 || ch == 0x180E)
      return true;
    if (ch < 0x2000 || ch > 0x3000)
      return false;
    return ch <= 0x200A || ch == 0x2028 || ch == 0x2029
      || ch == 0x202F || ch == 0x205F || ch == 0x3000;
  }
ZZ Coder
  • 74,484
  • 29
  • 137
  • 169
  • 6
    ZZ Coder -- you say, "it will be hard to find a library to do exactly what you want." Not true! Say you want to match all whitespace _except_ a \u00a0 (non-breaking space). Easy: CharMatcher.WHITESPACE.and(CharMatcher.isNot('\u00a0')).trimFrom(input) – Kevin Bourrillion Nov 04 '09 at 01:59
  • 2
    @KevinBourrillion just wanted to send over a big "thanks" for `CharMatcher.WHITESPACE`. `String#trim()` fails so hard with Unicode. – Matt Ball Mar 21 '13 at 22:38
2

I've always found trim to work pretty well for almost every scenario.

However, if you really want to include more characters, you can edit the strip method from commons-lang to include not only the test for Character.isWhitespace, but also for Character.isSpaceChar which seems to be what's missing. Namely, the following lines at stripStart and stripEnd, respectively:

  • while ((start != strLen) && Character.isWhitespace(str.charAt(start)))
  • while ((end != 0) && Character.isWhitespace(str.charAt(end - 1)))
João Silva
  • 89,303
  • 29
  • 152
  • 158
0

I did little changes on java's trim() method and it supports non-ascii characters.This method runs faster than most of the implementations.

public static String trimAdvanced(String value) {

        Objects.requireNonNull(value);

        int strLength = value.length();
        int len = value.length();
        int st = 0;
        char[] val = value.toCharArray();

        if (strLength == 0) {
            return "";
        }

        while ((st < len) && (val[st] <= ' ') || (val[st] == '\u00A0')) {
            st++;
            if (st == strLength) {
                break;
            }
        }
        while ((st < len) && (val[len - 1] <= ' ') || (val[len - 1] == '\u00A0')) {
            len--;
            if (len == 0) {
                break;
            }
        }


        return (st > len) ? "" : ((st > 0) || (len < strLength)) ? value.substring(st, len) : value;
    }
Ertuğrul Çetin
  • 5,131
  • 5
  • 37
  • 76
-1

This handles Unicode characters and doesn't require extra libraries:

String trimmed = original.replaceAll ("^\\p{IsWhite_Space}+|\\p{IsWhite_Space}+$", "");

A slight snag is that there are some related whitespace characters without Unicode character property "WSpace=Y" which are listed in Wikipedia. These probably won't cause a problem, but you can easy add them to the character class too.

Using almson-regex the regex will look like:

String trimmed = original.replaceAll (either (START_BOUNDARY + oneOrMore (WHITESPACE), oneOrMore (WHITESPACE) + END BOUNDARY), "");

and include the more relevant of the non-Unicode whitespace.

Aleksandr Dubinsky
  • 22,436
  • 15
  • 82
  • 99