3

EDIT: Resolved Answer: Was a 00a0 nonbreaking space, not a c0a0 nonbreaking space.

After using Apache POI to convert from docx to plaintext, and then reading the plaintext into Java and trying to parse it I've run into the following problems.

Output:

" "
first characterequals SPACE OR TAB 
false
[B@5e481248
[B@66d3c617
ARRAYTOSTRING SPACE: [32]
ARRAYTOSTRING ?????: [-62, -96]

For code:

System.out.println("\t\"" + line.substring(0,1) + "\"\n\tfirst characterequals SPACE OR TAB \n\t" + (line.substring(0,1).equals(" ") 
                        || line.substring(0,1).equals("\t") ));
System.out.println(line.substring(0,1).getBytes());
System.out.println(" ".getBytes());
System.out.println("ARRAYTOSTRING SPACE: " + Arrays.toString(" ".getBytes()));
System.out.println("ARRAYTOSTRING ?????: " + Arrays.toString(line.substring(0,1).getBytes()));

String.trim() does not get rid of it
String.replaceAll("\s" , "") does not get rid of it

I'm trying to parse an enormous materials document and this is turning into a major hurdle. I have no idea what's going on or how to interface with it, can anyone shed some light on what's going on here?

Captain Prinny
  • 459
  • 8
  • 23
  • 1
    You really should add at least a snippet of your extraction code. – llogiq Jun 03 '15 at 21:13
  • I'm not sure the extraction code would make much sense out of context, it's just pulling line by line and this debug snippet is duplicating the loop checks to make it visible what's actually being compared. – Captain Prinny Jun 04 '15 at 13:40

2 Answers2

3

This translates to the bytes with hex codes c2 a0, which according to this answer is a UTF-8 encoded non-breaking space. Note that this is not really a space and \s will not match it.

Community
  • 1
  • 1
llogiq
  • 13,815
  • 8
  • 40
  • 72
  • 1) Is there an easily referable source / set of characters that will appear as whitespace but not match it (or regex to include these) 2) Does this character have an escape sequence or anything simple that can be matched to it? – Captain Prinny Jun 03 '15 at 22:05
  • I used http://www.amp-what.com/unicode/search/space (though it contains a lot of other results). The escape sequence should be (somewhat unsurprisingly) `\u{c2a0}`. – llogiq Jun 03 '15 at 23:16
  • Are there any other faux whitespace I might run into, or is this the outlier – Captain Prinny Jun 03 '15 at 23:36
  • Again, look at the amp-what page, it lists a few. Notable offfenders are \u200B, the zero-width whitespace as well as \u2002 up to \u200a (various white space widths) and \u200F (which is a narrow non-breaking space). – llogiq Jun 04 '15 at 08:39
  • Thanks a lot, this is super helpful. – Captain Prinny Jun 04 '15 at 13:17
  • Actually, turns out that your evaluation is incorrect. The toString of the byte string of \uc2a0 is [-20, -118, -96], and a base val.equals("\uc2a0") returns false. Searching on the character from the output returns  , not familiar with unicode enough to know how that converts to hex such that you got c2a0 or how to escape it. According to unicodelookup the hex is just a0. – Captain Prinny Jun 04 '15 at 13:33
  • #160 is a non-breaking space. The same analysis applies. See http://www.amp-what.com/unicode/search/%23160 – llogiq Jun 16 '15 at 14:00
  • The analysis applies, but it required further investigation to implement. – Captain Prinny Jun 16 '15 at 14:06
0

this worked for me:

 String valor = org.apache.commons.lang3.StringUtils.normalizeSpace(java.text.Normalizer.normalize(valor, java.text.Normalizer.Form.NFD));