I know this question has been already asked several times but I can't find the way to apply it on my code. So my propose is the following: I have two files griechenland_test.txt and outagain5.txt . I want to read them and then get which percentage of outagain5.txt is inside the other file.
Outagain5 has input like that:
mit dem 542824
und die 517126
And Griechenland is an normal article from Wikipedia about that topic (so like normal text, without freqeuncy Counts).
1. Problem - How can I split the input in bigramms? Like every two words, but always with the one before? So if I have words A, B, C, D --> get AB, BC, CD ? I have this:
while ((sCurrentLine = in.readLine()) != null) {
// System.out.println(sCurrentLine);
arr = sCurrentLine.split(" ");
for (int i = 0; i < arr.length; i++) {
if (null == hash.get(arr[i])) {
hash.put(arr[i], 1);
} else {
int x = hash.get(arr[i]) + 1;
hash.put(arr[i], x);
}
}
Then I read the other file with this code ( I just add the word, and not the number (I split it with 4 spaces, so the two words are at h[0])).
for (String line = br.readLine(); line != null; line = br.readLine()) {
String h[] = line.split(" ");
words.add(h[0]);
}
2. Problem Now I make the comparsion between the String x in hash and the String s in words. I have put the else System out.print to get which words are not contained in outagain5.txt, but there are several words printed out which ARE contained in outagain5.txt. I don't understand why :D So I think that the comparsion doesn't work well or maybe this will be solved will fix the first problem.
ArrayList<String> words = new ArrayList<String>();
ArrayList<String> neuS = new ArrayList<String>();
ArrayList<Long> neuZ = new ArrayList<Long>();
for (String x : hash.keySet()) {
summe = summe + hash.get(x);
long neu = hash.get(x);
for (String s : words) {
if (x.equals(s)) {
neuS.add(x);
neuZ.add(neu);
disc = disc + 1;
} else {
System.out.println(x);
break;
}
}
}
Hope I made my question clear, thanks a lot!!