EDIT: Resolved Answer: Was a 00a0 nonbreaking space, not a c0a0 nonbreaking space.
After using Apache POI to convert from docx to plaintext, and then reading the plaintext into Java and trying to parse it I've run into the following problems.
Output:
" "
first characterequals SPACE OR TAB
false
[B@5e481248
[B@66d3c617
ARRAYTOSTRING SPACE: [32]
ARRAYTOSTRING ?????: [-62, -96]
For code:
System.out.println("\t\"" + line.substring(0,1) + "\"\n\tfirst characterequals SPACE OR TAB \n\t" + (line.substring(0,1).equals(" ")
|| line.substring(0,1).equals("\t") ));
System.out.println(line.substring(0,1).getBytes());
System.out.println(" ".getBytes());
System.out.println("ARRAYTOSTRING SPACE: " + Arrays.toString(" ".getBytes()));
System.out.println("ARRAYTOSTRING ?????: " + Arrays.toString(line.substring(0,1).getBytes()));
String.trim() does not get rid of it
String.replaceAll("\s" , "") does not get rid of it
I'm trying to parse an enormous materials document and this is turning into a major hurdle. I have no idea what's going on or how to interface with it, can anyone shed some light on what's going on here?