3

I'm making out a couple of ideas to make an algorithm that would find 3 most common words in multiple sentences. What do I mean by that? Let's have a look at the example below, let's say I have 3 sentence like as follows:

1. New Samsung Galaxy S7 Edge SM-G935FD Duos 12MP 4G (FACTORY UNLOCKED) 32GB Phone
2. Samsung Galaxy S7 32GB G930P (GSM Unlocked) 4G LTE 12MP Smartphone Black A
3. New Samsung Galaxy S7 SM-G930FD Duos 5.1'' 12MP (FACTORY UNLOCKED) 32GB Phone

The algorithm determines that the 3 most commons words (all next to eachother) are: "Samsung galaxy S7".

My idea (I believe this is the most simplest one that can be implemented) is to take out the first 3 words from the first sentence and start out like that. So for example:

1st loop I get these 3 combinations of words: New Samsung Galaxy 2nd loop I get these 3 combinations of words (excluding the first word in the sentence): Samsung galaxy S7...

So on goes the process till the first sentence (string) ends.

Now my question to you guys is:

  1. Is this a good way to do like I mentioned above?
  2. Are there Algorithms out there that could do the same thing, but are more efficient when time factor comes in question (ie. they work faster)?

Can someone help me out with this? Thanks ! :)

User987
  • 3,663
  • 15
  • 54
  • 115
  • Do the words have to be right next to each other? – paul Oct 20 '16 at 13:45
  • Yes @paul that's the criteria – User987 Oct 20 '16 at 13:50
  • Do you already have code that does what you described above? – paul Oct 20 '16 at 13:51
  • Not yet, I'm still working out all possible ideas before I start :) – User987 Oct 20 '16 at 13:52
  • From my perspective it looks like you have the best idea but, how you implement it in your code is a large factor for speed and efficiency. As far as the idea for your algorithm I believe it is best depending on the code. – paul Oct 20 '16 at 13:56
  • One ques. How do you differentiate `common` and `most common` in your case? – Saurav Sahu Oct 20 '16 at 13:59
  • @User987: do you want a word which is present in all the `n` given sentences or do you want a want that is present in the most number of sentences??? Please clarify – User_Targaryen Oct 20 '16 at 14:25
  • Hi @User_Targaryen I want the most common sequence of 3 words present in N given sentences. If the "Samsung galaxy S7" is the most common one in all sentences, then this is the result I'm seeking. Do you understand ? :) – User987 Oct 20 '16 at 14:31

2 Answers2

0

No, there is no fastest way because to find the three most common words in the string array you must scan the lines to check the possible match.
But there is an improvment: if the three words are unique in the strings (there is only one Samsung Galaxy S7 per sentence) and you want exit as soon as you have find the first string of most common words you can make the following control:

if(counter == array.length)
   return mostCommonWords

This because if the three words are present in all string of the array you know that the other word groups will have at maximum the same counter. But this control only work if the three words are unique per sentence and you want get the first most common occurence

Tinwor
  • 7,765
  • 6
  • 35
  • 56
  • Okay I get the idea now. Thanks! I'll accept this as an answer because I get the idea now approximately on how to do it ! :) – User987 Oct 20 '16 at 14:05
  • You are welcome, if you want post the code on github I'll give you an hint – Tinwor Oct 20 '16 at 14:06
0

Using hashmap along with arraylist would be appropriate:

HashMap<String,ArrayList<Integer>> map = new HashMap<String,ArrayList<Integer>(NumOfSentences)>();    

where String stores the three word phrase and Arraylist stores the corresponding frequency at each sentence indexes.

Caution: Just storing the count of occurrence won't help as you may at end won't know for sure which all sentences had that phrase.

In your case, map would look something like this:

//...other Entries
{"Samsung Galaxy S7",  {1, 1, 1}}
//...other Entries

You can see it has frequencies corresponding to all the sentence indexes. You need to find the minimum of the arraylist and consider that as overall frequency for that phrase.


How to decide most common - consider you added the phrase twice in each sentence, then map would look like this:

//...other Entries
{"Some-3-word-phrase-present-only-ONCE-in-each-sentence",  {1, 1, 1}}
{"Some-3-word-phrase-present-TWICE-in-each-sentence",  {2, 2, 2}}
//...other Entries

Clearly, the later one would be considered as the answer.

Saurav Sahu
  • 13,038
  • 6
  • 64
  • 79