0

I am having some problem with Java programming which includes List. Basically, what I am trying to count the occurences of each word in a sentence from a list containing several sentences. The code for the list containing sentences is as below:

List<List<String>> sort = new ArrayList<>();
for (String sentence : complete.split("[.?!]\\s*"))
{
    sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
}

The output from the list is as follows:

[hurricane, gilbert, head, dominican, coast]
[hurricane, gilbert, sweep, dominican, republic, sunday, civil, defense, alert, heavily, populate, south, coast, prepare, high, wind]
[storm, approach, southeast, sustain, wind, mph, mph]
[there, alarm, civil, defense, director, a, television, alert, shortly]

The output desired should be as follows (only an example). It will output all the unique word in the list and calculate the occurences by sentences.

Word: hurricane
Sentence 1: 1 times
Sentence 2: 1 times
Sentence 3: 0 times
Sentence 4: 0 times

Word: gilbert
Sentence 1: 0 times
Sentence 2: 2 times
Sentence 3: 1 times
Sentence 4: 0 times 

Word: head
Sentence 1: 3 times
Sentence 2: 2 times
Sentence 3: 0 times
Sentence 4: 0 times 

and goes on....

With the example above, the word 'hurricane' occur 1 time in the first sentence, 1 time in second sentence, none in third sentence and none in forth sentence. How do I achieve the output? I was thinking of a 2D matrices for building them. Any help will be appreciated. Thanks!

Cael
  • 556
  • 1
  • 10
  • 36
  • could you include what the input is right now? – SomeJavaGuy Jan 25 '16 at 06:50
  • Hi, I've already included the output as above. The one from the list. It looks weird because I've removed the stop word (exp: the, a , an) and stem the words (exp: knew -> know) – Cael Jan 25 '16 at 06:52
  • Is the count correct? because gilbert only appear once in sentence 1 and once in sentence 2. But in your output it says different. – Simone Zandara Jan 25 '16 at 06:58
  • Opps. the count is not accurate. That's just an example but I've edited it(only for hurricane) – Cael Jan 25 '16 at 06:59

3 Answers3

1

This is a working solution. I did not take care of the printing. The result is a Map -> Word, Array. Where Array contains the count of Word in each sentence indexed from 0. Runs in O(N) time. Play here: https://repl.it/Bg6D

    List<List<String>> sort = new ArrayList<>();
    Map<String, ArrayList<Integer>> res = new HashMap<>();

    // split by sentence
    for (String sentence : someText.split("[.?!]\\s*")) {
        sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
    }

    // put all word in a hashmap with 0 count initialized
    final int sentenceCount = sort.size();
    sort.stream().forEach(sentence -> sentence.stream().forEach(s -> res.put(s, new ArrayList<Integer>(Collections.nCopies(sentenceCount, 0)))));

    int index = 0;
    // count the occurrences of each word for each sentence.
    for (List<String> sentence: sort) {
        for (String s : sentence) {
            res.get(s).set(index, res.get(s).get(index) + 1);
        }
        index++;
    }

EDIT: In answer to your comment.

  List<Integer> getSentence(int sentence, Map<String, ArrayList<Integer>> map) {
     return map.entrySet().stream().map(e -> e.getValue().get(sentence)).collect(Collectors.toList());
  }

Then you can call

List<Integer> sentence0List = getSentence(0, res);

However be aware that this approach is not optimal since it runs in O(K) time with K being the number of sentences. For small K it is totally fine but it does not scale. You have to clarify yourself what will you do with the result. If you need to call getSentence many times, this is not the correct approach. In that case you will need the data structured differently. Something like

Sentences = [
         {'word1': N, 'word2': N},... // sentence 1 
         {'word1': N, 'word2': N},... // sentence 2

]

So you are able to easily access the word count per each sentence.

EDIT 2: Call this method:

  Map<String, Float> getFrequency(Map<String, ArrayList<Integer>> stringMap) {
    Map<String, Float> res = new HashMap<>();
    stringMap.entrySet().stream().forEach(e -> res.put(e.getKey()
                , e.getValue().stream().mapToInt(Integer::intValue).sum() / (float)e.getValue().size()));
    return res;
  }

Will return something like this:

{standard=0.25, but=0.25, industry's=0.25, been=0.25, 1500s=0.25, software=0.25, release=0.25, type=0.5, when=0.25, dummy=0.5, Aldus=0.25, only=0.25, passages=0.25, text=0.5, has=0.5, 1960s=0.25, Ipsum=1.0, five=0.25, publishing=0.25, took=0.25, centuries=0.25, including=0.25, in=0.25, like=0.25, containing=0.25, printer=0.25, is=0.25, t
Simone Zandara
  • 9,401
  • 2
  • 19
  • 26
  • thx for ur time. Your solution works but is there a way I could access all the first value from each word? For example, {to=[0, 1, 0, 0], popularised=[0, 0, 0, 1], simply=[1, 0, 0, 0]} -> I'll get 0,0,1 (value from first data of each word). – Cael Jan 25 '16 at 08:53
  • It is easily possible, however it depends on what you actually need. Do you need grouping per word or per sentence? I made the solution considering that you needed a per word segmentation. Anyway I make a small edit to get the sentence count – Simone Zandara Jan 25 '16 at 09:05
  • Thanks. Btw one last thing, I need to get the total occurences for each word, sum them up, and divide by the total number of sentences. Something like this for each word: Exp 1: prepare=[0, 1, 0, 0] -> so 1/4. Exp 2: look=[1,1,0,0] -> so 2/4. These value may be stored somewhere. Appreciate ur help on this. – Cael Jan 25 '16 at 09:48
0

You could solve your problem by first creating an index for each word. You could use a Hashmap and put just put all the single words on it, which you find in your text (so you would have no need for checking double occurrences).

Then you can iterate the HashMap and check for every Word in every sentence. You can count occurrences by using the indexOf method of your list. As long as it returns a value greater than -1 you can count up the occurrence in the sentence. This method does only return the first occurrence so you

Some Pseudocode would be like:

Array sentences = text.split(sentence delimiter)

for each word in text
    put word on hashmap

for each entry in hashmap
   for each sentence
       int count = 0
       while subList(count, sentence.length) indexOf(entry) > -1
          count for entry ++

Note that this is very greedy and not performance oriented at all. Oh yea, and also note, that there are some java nlp libraries out there which may have already solved your problem in a performance oriented and reusable way.

Community
  • 1
  • 1
Jankapunkt
  • 8,128
  • 4
  • 30
  • 59
0

First you can segment your sentences and then tokenize them using a text segmentor such as NLTK or Stanford tokenizer. Splitting the string (containing sentences) around "[.?!]" is not a good idea. What happens to an "etc." or "e.g." that occurs in the middle of the sentence? Splitting a sentence around "[ ,;:]" is also not a good idea. You can have plenty of other symbols in a sentence such as quotation marks, dash and so on.

After segmentation and tokenization you can split your sentences around space and store them in a List<List<String>>:

List<List<String>> sentenceList = new ArraList();

Then for your index you can create a HashMap<String,List<Integer>>:

HashMap<String,List<Integer>> words = new HashMap();

Keys are all words in all sentences. Values you can update as follows:

for(int i = 0 ; i < sentenceList.size() ; i++){
    for(String w : words){
        if(sentence.contains(w)){
           List tmp = words.get(w);
           tmp.get(i)++; 
           words.put(w, tmp);
         }
    }
}

This solution has the time complexity of O(number_of_sentences*number_of_words) which is equivalent to O(n^2). An optimized solution is:

for(int i = 0 ; i < sentenceList.size() ; i++){
    for(String w : sentenceList.get(i)){
        List tmp = words.get(w);
        tmp.get(i)++; 
        words.put(w, tmp);
    }
}

This has the time complexity of O(number_of_sentences*average_length_of_sentences). Since average_length_of_sentences is usually small this is equivalent to O(n).

MAZDAK
  • 573
  • 1
  • 4
  • 16