4

I have class with main:

public class Main {

// args[0] - is path to file with first and last words
// args[1] - is path to file with dictionary 
public static void main(String[] args) {
    try {
        List<String> firstLastWords = FileParser.getWords(args[0]);
            System.out.println(firstLastWords);
        System.out.println(firstLastWords.get(0).length());

    } catch (IOException ex) {
        ex.printStackTrace();
    }
}
}

and I have FileParser:

public class FileParser {

    public FileParser() {
    }

    final static Charset ENCODING = StandardCharsets.UTF_8;


    public static List<String> getWords(String filePath) throws IOException {
        List<String> list = new ArrayList<String>();
        Path path = Paths.get(filePath);

        try (BufferedReader reader = Files.newBufferedReader(path, ENCODING)) {
            String line = null;
            while ((line = reader.readLine()) != null) {

                String line1 = line.replaceAll("\\s+","");
                if (!line1.equals("") && !line1.equals(" ") ){
                    list.add(line1);
                }
            }
            reader.close();
        }
        return list;
    }   
}

args[0] is the path to txt file with just 2 words. So if file contains:

тор
кит

programm returns:

[тор, кит]
4

If file contains:

т
тор
кит

programm returns:

[т, тор, кит]
2


even if file contains:
//jump to next line
тор
кит

programm returns:

[, тор, кит]
1

where digit - is length of the first string in the list.

So the question is why it counts one more symbol?

HasaDev
  • 117
  • 9
  • 1
    From the documentation of String#length - http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length() "Returns the length of this string. The length is equal to the number of Unicode code units in the string." – Garis M Suero Apr 28 '15 at 00:42
  • 1
    And how does *that* explain the OP's problem? – Erwin Bolwidt Apr 28 '15 at 00:44
  • 1
    I don't understand the downvotes on this question - this guy properly includes all relevant code, describes what happens, and describes what he expected. It's almost the posterchild of a proper code question. I see tens of horrible code questions everyday that don't get any downvotes. Please explain yourself, downvoters. – Erwin Bolwidt Apr 28 '15 at 00:49
  • 1
    There could be some sort of unprintable character inside your file. Could you try going through every character in the string and printing it out individually? – kirbyquerby Apr 28 '15 at 00:53
  • 1
    @erwinbolwidt - i completely agree; i actually up voted it – Krease Apr 28 '15 at 00:53
  • try changing StandardCharset encoding – mangusta Apr 28 '15 at 00:54
  • @ErwinBolwidt maybe because the question was not formatted corectly. and cannot be reproduced (so far) – Baby Apr 28 '15 at 00:55
  • Use a debugger and inspect the string contents. – bmargulies Apr 28 '15 at 00:57
  • @Baby it's not anymore. Votes should reflect the current quality of the question, it's not a punishment for the poster. And whether or not it can be reproduced has no bearing on the quality of the question. There's a on-hold reason for unreproducible questions, if it really turns out to be the case. – Erwin Bolwidt Apr 28 '15 at 00:59
  • You are right Erwin. He tried his best with the code but not getting the output he expected. I have seen dumb question but never get downvotes – Mohan Raj Apr 28 '15 at 01:09
  • @ErwinBolwidt yeah don''t get me wrong, I'm not the downvoter, nor the upvoter. – Baby Apr 28 '15 at 01:10
  • This gives the expected result: `System.out.println("кит".length());`; so it's not the Cyrillic, although it does not work correctly using regex and certain character classes (`\p{Graph}`). I suspect certain control characters that are not recognized as whitespace (`\s`), or an incorrect sequence of `\r\n`. – YoYo Apr 28 '15 at 01:16
  • 1
    Does your file have some sort of BOM in it? http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 – BillRobertson42 Apr 28 '15 at 01:29
  • the problem is just with the first word, if try firstLastWords.get(1).length() it will give right result. – HasaDev Apr 28 '15 at 01:30
  • 1
    What does `System.out.println(((int)firstLastWords.get(0).charAt(0)));` show? – BillRobertson42 Apr 28 '15 at 01:33
  • @Bill it gives 65279 – HasaDev Apr 28 '15 at 01:37
  • @Bill I have just changed the line String line1 = line.replaceAll("\uFEFF",""); and it worked right . Thank you – HasaDev Apr 28 '15 at 01:40
  • Whatever that character is, \\s+ in your regex won't filter that out. You'll need to figure out how that's getting in your file. Good luck! – BillRobertson42 Apr 28 '15 at 01:41
  • 1
    That looks like a UTF-16 BOM. – BillRobertson42 Apr 28 '15 at 01:42

2 Answers2

2

Thanks all.

This symbol as said @Bill is BOM (http://en.wikipedia.org/wiki/Byte_order_mark) and reside at the beginning of a text file. So i found this symbol by this line:

System.out.println(((int)firstLastWords.get(0).charAt(0)));

it gave me 65279

then i just changed this line:
String line1 = line.replaceAll("\\s+",""); to this

String line1 = line.replaceAll("\uFEFF","");
HasaDev
  • 117
  • 9
1

Cyrillic characters are difficult to capture using Regex, eg \p{Graph} does not work, although they are clearly visible characters. Anyways, that is besides the OP question.

The actual problem is likely due to other non-visible characters, likely control characters present. Try following regex to remove more: replaceAll("(\\s|\\p{Cntrl})+",""). You can play around with the Regex to further extend that to other cases.

YoYo
  • 9,157
  • 8
  • 57
  • 74
  • Can you try this: `replaceAll("(\\s|\\p{Cntrl}|\\n|\\r)+","")` - and let me know the outcome. – YoYo Apr 28 '15 at 01:41
  • try also putting this code `for (byte b:line1.getBytes()) {System.out.print(((long)b)&0xFF);System.out.print("/");} System.out.println();` to work out what that hidden character is. – YoYo Apr 28 '15 at 01:50