-1

I asked a question earlier but met harsh criticism, so here I pose it again. Simpler, and rephrased to appeal to those who may have been concerned about the way I asked it before.

BACKGROUND I am parsing some HTML for information. I have isolated everything in a series of lines but the content I wish to grab and a bunch of spaces after it. To get rid of the spaces, I opted to use trim(), but I have been having trouble. The last few lines of my code are tests:

System.out.println("'" + someString + "'\n'" + someString.trim() + "'");

The results were:

'Sophomore                                          '
'Sophomore                                          '

I was worried I might have a problem with the way I was calling trim(), since we all make mistakes from time to time, so I tested it like this:

String s = "   hello         ";
System.out.println("'" + s+ "'\n'" + s.trim() + "'");

The results were:

'  hello     '
'hello'

MY QUESTION What am I doing wrong? What I want is to get 'Sophomore', not 'Sophomore                                          '

I look forward to your excellent answers (thanks in advance!).

Alnitak
  • 334,560
  • 70
  • 407
  • 495
Olin Kirkland
  • 548
  • 4
  • 23

2 Answers2

3

String.trim() specifically only removes characters before the first character whose code exceeds \u0020, and after the last such character.

This is insufficient to remove all possible white space characters - Unicode defines several more (with code points above \u0020) that will not be matched by .trim().

Perhaps your white space characters aren't the ones you think they are?

EDIT comments revealed that the extra characters were indeed "special" whitespace characters, specifically \u00a0 which is a Unicode "non-breaking space". To replace those with normal spaces, use:

str = str.replace('\u00a0', ' ');
Alnitak
  • 334,560
  • 70
  • 407
  • 495
  • THANK YOU. THAT MIGHT BE IT. I've been thinking this for a while. What could they be?? If they aren't spaces, why do they look like them?? – Olin Kirkland Sep 09 '12 at 23:28
  • Agree. The critical thing he's not showing us is the pre-processed text such as a small test case data that shows the error. 1+ – Hovercraft Full Of Eels Sep 09 '12 at 23:28
  • @OlinKirkland try looping over the string and using `codePointAt` to find out each characters values. They might be alternate unicode characters, for example. – Alnitak Sep 09 '12 at 23:29
  • @ Hovercraft, what do you mean by the pre-processed text? The exact copy before I cut out the beginning and end of the string? – Olin Kirkland Sep 09 '12 at 23:29
  • @Olin: a small bit of text that when processed by the [sscce](http://sscce.org) that you would normally post for a question like this, would reproduce the problem. – Hovercraft Full Of Eels Sep 09 '12 at 23:30
  • 53 6f 70 68 6f 6d 6f 72 65 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 a0 This is the output when I did what muratgu suggested. I think I'm getting a little deep for my expertise. What does this mean, exactly? Do you guys know? – Olin Kirkland Sep 09 '12 at 23:37
  • The `a0` characters are your problem. They're a Unicode "no-break-space" and as such not recognised by `.trim()`. http://www.fileformat.info/info/unicode/char/a0/index.htm – Alnitak Sep 09 '12 at 23:40
  • Well, damn. Will I just have to write my own trim() method to take care of just a0 characters? Or better yet, use replace() to replace a0 with " "? How do I apply replace to the character hexes? Stupid college sites not using actual spaces. – Olin Kirkland Sep 09 '12 at 23:41
  • @OlinKirkland, [Non-breaking space](http://en.wikipedia.org/wiki/Non-breaking_space) – Alexander Sep 09 '12 at 23:42
  • 1
    @OlinKirkland you should be able to write a regex (oh, the irony...) to replace `\u0040` with an normal space, and then use `.trim` as before. – Alnitak Sep 09 '12 at 23:42
  • http://stackoverflow.com/questions/4455218/remove-specific-character-from-a-string-based-on-hex-value-c-sharp – muratgu Sep 09 '12 at 23:42
  • @muratgu that answer is c#, not Java – Alnitak Sep 09 '12 at 23:43
  • @OlinKirkland `str = str.replace('\u0040', ' ');` – Alnitak Sep 09 '12 at 23:44
1

There must be a non-whitespace character in the source string. Add the following to your code and see what it prints.

for (char ch : someString.toCharArray()) {
     System.out.print(Integer.toHexString(ch) + " ");
}
muratgu
  • 7,241
  • 3
  • 24
  • 26