Strange behaviour of String.length()

Question

I have class with main:

public class Main {

// args[0] - is path to file with first and last words
// args[1] - is path to file with dictionary 
public static void main(String[] args) {
    try {
        List<String> firstLastWords = FileParser.getWords(args[0]);
            System.out.println(firstLastWords);
        System.out.println(firstLastWords.get(0).length());

    } catch (IOException ex) {
        ex.printStackTrace();
    }
}
}

and I have FileParser:

public class FileParser {

    public FileParser() {
    }

    final static Charset ENCODING = StandardCharsets.UTF_8;


    public static List<String> getWords(String filePath) throws IOException {
        List<String> list = new ArrayList<String>();
        Path path = Paths.get(filePath);

        try (BufferedReader reader = Files.newBufferedReader(path, ENCODING)) {
            String line = null;
            while ((line = reader.readLine()) != null) {

                String line1 = line.replaceAll("\\s+","");
                if (!line1.equals("") && !line1.equals(" ") ){
                    list.add(line1);
                }
            }
            reader.close();
        }
        return list;
    }   
}

args[0] is the path to txt file with just 2 words. So if file contains:

тор
кит

programm returns:

[тор, кит]
4

If file contains:

т
тор
кит

programm returns:

[т, тор, кит]
2

even if file contains:
//jump to next line
тор
кит

programm returns:

[, тор, кит]
1

where digit - is length of the first string in the list.

So the question is why it counts one more symbol?

From the documentation of String#length - http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length() "Returns the length of this string. The length is equal to the number of Unicode code units in the string." — Garis M Suero, Apr 28 '15 at 00:42
I don't understand the downvotes on this question - this guy properly includes all relevant code, describes what happens, and describes what he expected. It's almost the posterchild of a proper code question. I see tens of horrible code questions everyday that don't get any downvotes. Please explain yourself, downvoters. — Erwin Bolwidt, Apr 28 '15 at 00:49
There could be some sort of unprintable character inside your file. Could you try going through every character in the string and printing it out individually? — kirbyquerby, Apr 28 '15 at 00:53
@ErwinBolwidt maybe because the question was not formatted corectly. and cannot be reproduced (so far) — Baby, Apr 28 '15 at 00:55
@Baby it's not anymore. Votes should reflect the current quality of the question, it's not a punishment for the poster. And whether or not it can be reproduced has no bearing on the quality of the question. There's a on-hold reason for unreproducible questions, if it really turns out to be the case. — Erwin Bolwidt, Apr 28 '15 at 00:59
You are right Erwin. He tried his best with the code but not getting the output he expected. I have seen dumb question but never get downvotes — Mohan Raj, Apr 28 '15 at 01:09
@ErwinBolwidt yeah don''t get me wrong, I'm not the downvoter, nor the upvoter. — Baby, Apr 28 '15 at 01:10
This gives the expected result: `System.out.println("кит".length());`; so it's not the Cyrillic, although it does not work correctly using regex and certain character classes (`\p{Graph}`). I suspect certain control characters that are not recognized as whitespace (`\s`), or an incorrect sequence of `\r\n`. — YoYo, Apr 28 '15 at 01:16
Does your file have some sort of BOM in it? http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 — BillRobertson42, Apr 28 '15 at 01:29
the problem is just with the first word, if tryfirstLastWords.get(1).length() it will give right result. — HasaDev, Apr 28 '15 at 01:30
What does `System.out.println(((int)firstLastWords.get(0).charAt(0)));` show? — BillRobertson42, Apr 28 '15 at 01:33
@Bill I have just changed the line String line1 = line.replaceAll("\uFEFF",""); and it worked right . Thank you — HasaDev, Apr 28 '15 at 01:40
Whatever that character is, \\s+ in your regex won't filter that out. You'll need to figure out how that's getting in your file. Good luck! — BillRobertson42, Apr 28 '15 at 01:41

score 2 · Answer 1 · answered Apr 28 '15 at 09:21

Thanks all.

This symbol as said @Bill is BOM (http://en.wikipedia.org/wiki/Byte_order_mark) and reside at the beginning of a text file. So i found this symbol by this line:

System.out.println(((int)firstLastWords.get(0).charAt(0)));

it gave me 65279

then i just changed this line:
String line1 = line.replaceAll("\\s+",""); to this

String line1 = line.replaceAll("\uFEFF","");

score 1 · Answer 2 · answered Apr 28 '15 at 01:24

1

Cyrillic characters are difficult to capture using Regex, eg \p{Graph} does not work, although they are clearly visible characters. Anyways, that is besides the OP question.

The actual problem is likely due to other non-visible characters, likely control characters present. Try following regex to remove more: replaceAll("(\\s|\\p{Cntrl})+",""). You can play around with the Regex to further extend that to other cases.

answered Apr 28 '15 at 01:24

YoYo

9,157
8
57
74

Can you try this: `replaceAll("(\\s|\\p{Cntrl}|\\n|\\r)+","")` - and let me know the outcome. – YoYo Apr 28 '15 at 01:41
try also putting this code `for (byte b:line1.getBytes()) {System.out.print(((long)b)&0xFF);System.out.print("/");} System.out.println();` to work out what that hidden character is. – YoYo Apr 28 '15 at 01:50

Strange behaviour of String.length()

2 Answers2