Comparing 2 text files of different locale in java

Question

I am trying to compare 2 files out of which, one is plain text(non-english) and other is glossary in key value pair. They look similar to this:

Japanese Text file:

わたしのなまえはしんです。
ソフトウェアインギネアとしてはたらいています.

En-Jp properties file:

as:と
software:ソフトウェア
me:わたしを
name:なまえ
I:わたしは
working:はたらいています。
...

I am trying to compare these 2 files content wise with below code:

        Scanner kb = new Scanner(System.in);
        String localtext;
        String glossarytext;
        File dictionary = new File("./src/main/resources/ZN_EN_Test.txt"); 
       Scanner dictScanner = new Scanner(dictionary); 
       File list = new File("./src/main/resources/ZN_JP_Test.txt");
      try
        {
          while(dictScanner.hasNextLine()){

            glossarytext=dictScanner.nextLine();

                try (Scanner listScanner = new Scanner(list);){
                    while(listScanner.hasNextLine()){
                       localtext=listScanner.nextLine();

                        if(glossarytext.contains(localtext))
                        System.out.println(localtext);

                    }
                }
            }

        } catch(NoSuchElementException e) {
            e.printStackTrace();
    }

Problem here is, since the Japanese text do not have space in between 2 words, scanner seems to be failing to pass contains condition. The same program runs successfully if I arrange words something like below:

わたしの
なまえ
は
しん
です。

How should I make it work to find the matching contents without formatting Japanese text file.

score 1 · Accepted Answer · edited May 23 '17 at 11:59

I try to reformulate the question: you have a plain text with no delimiter, and a dictionary (possible more words in dictionary than in text ?), and you want to know if the plain text is a concatenation of the dictionary words - true or false -.

Scanner is more intended to work with delimiter. And you dont have.

Better use Matcher.

1 Then you have to construct a regex, with all of your dictionary words (word1 | word2 | word3 | ....) *

2 and you match

If you have too many words in dictionary, see this: Java : does regex pattern matcher have a size limit?

there is also link to Aho–Corasick algorithm

Remark 1: if you want to get the decomposition, see this: Create array of regex matches

Remark 2: the response can be ambiguious, depends on your words (if you have AA, BB, and AABB in your dictionary - I dont know japanese).

Hope this helps

Thanks a lot for detailed explanation. `Matchers` are helpful, as I only want to see if particular pattern exists or not. I really don't want to extract matched pattern. But all above mentioned remarks and references are really helpful. — MKay, Nov 25 '15 at 05:41

Comparing 2 text files of different locale in java

1 Answers1