Compare content of two text files and split words java

Question

I know this question has been already asked several times but I can't find the way to apply it on my code. So my propose is the following: I have two files griechenland_test.txt and outagain5.txt . I want to read them and then get which percentage of outagain5.txt is inside the other file.

Outagain5 has input like that:

mit dem    542824
und die    517126

And Griechenland is an normal article from Wikipedia about that topic (so like normal text, without freqeuncy Counts).

1. Problem - How can I split the input in bigramms? Like every two words, but always with the one before? So if I have words A, B, C, D --> get AB, BC, CD ? I have this:

 while ((sCurrentLine = in.readLine()) != null) {
            // System.out.println(sCurrentLine);
            arr = sCurrentLine.split(" ");
            for (int i = 0; i < arr.length; i++) {
                if (null == hash.get(arr[i])) {
                    hash.put(arr[i], 1);
                } else {
                    int x = hash.get(arr[i]) + 1;
                    hash.put(arr[i], x);
                }
            }

Then I read the other file with this code ( I just add the word, and not the number (I split it with 4 spaces, so the two words are at h[0])).

 for (String line = br.readLine(); line != null; line = br.readLine()) {
        String h[] = line.split("   ");

        words.add(h[0]);

    }

2. Problem Now I make the comparsion between the String x in hash and the String s in words. I have put the else System out.print to get which words are not contained in outagain5.txt, but there are several words printed out which ARE contained in outagain5.txt. I don't understand why :D So I think that the comparsion doesn't work well or maybe this will be solved will fix the first problem.

    ArrayList<String> words = new ArrayList<String>();
    ArrayList<String> neuS = new ArrayList<String>();
    ArrayList<Long> neuZ = new ArrayList<Long>();

for (String x : hash.keySet()) {
        summe = summe + hash.get(x); 
        long neu = hash.get(x);
        for (String s : words) {

            if (x.equals(s)) {
                neuS.add(x);
                neuZ.add(neu);
                disc = disc + 1;
            } else {
                System.out.println(x);
                break;
            }

        }
    }

Hope I made my question clear, thanks a lot!!

score 0 · Answer 1 · answered Jul 15 '15 at 17:21

0

If I recall, String has a method titled split(regex, count) that will split the item according to a specific point and you can tell it how many times to do it.

I am referencing this JavaDoc https://docs.oracle.com/javase/6/docs/api/java/lang/String.html#split(java.lang.String, int).

And I guess for running comparison between two text files I would recommend having your code read both of them, populated two unique arrays and then try to run comparisons between the two strings each time. Hope I helped.

answered Jul 15 '15 at 17:21

SomeStudent

2,856
1
22
36

I get resolved the splitting problem. Your propose would be good if I only want to compare, but I also need to get the frequency. So if a make a hashSet here the values are unique, i can't do something like hashSet.size() to get the number of total words. Maybe I wasn't clear at the question. I mean, I want to know the percentage which makes the words von **outagain** inside *griechenland**. So for example, if only 8 Bigramms are the same, and griechenland supossed to had 100 words, to can get the output 8%... – lydiaP Jul 15 '15 at 18:09

score 0 · Accepted Answer · edited May 23 '17 at 11:58

public static List<String> ngrams(int n, String str) {
    List<String> ngrams = new ArrayList<String>();
    String[] words = str.split(" ");
    for (int i = 0; i < words.length - n + 1; i++)
        ngrams.add(concat(words, i, i+n));
    return ngrams;
}

public static String concat(String[] words, int start, int end) {
    StringBuilder sb = new StringBuilder();
    for (int i = start; i < end; i++)
        sb.append((i > start ? " " : "") + words[i]);
    return sb.toString();
}

It is much easier to use the generic "n-gram" approach so you can split every 2 or 3 words if you want. Here is the link I used to grab the code from: I have used this exact code almost any time I need to split words in the (AB), (BC), (CD) format. NGram Sequence.

Compare content of two text files and split words java

2 Answers2

Linked