0

I have big text such as :

 really!!!  Oh Oh! You read about them in a book and they told you to wear       clothes? buahahahaham Did they also tell you how they were able to sew the leaves that they used to cover up? You amu

Also I have an arraylist of some words and expression such as really or oh oh! Now I want to count the number of occurrence of the phrases (which is in the arraylist ) in the given text above or any similar text. So for that I first split the text to words and start looping as follow:

String[] word=content.split("\\s+");
for(int j=0;j<word.length;j++){
    if(sexuality.contains(word[j])){
        swCount=sw+1;
    }

But this does not work since the oh oh! or really cannot be picked by the above method. Can anyone help?

Alessio
  • 3,404
  • 19
  • 35
  • 48
HMdeveloper
  • 2,772
  • 9
  • 45
  • 74

4 Answers4

2

This counts the occurences of any searchString in your input.

String input = "....";
List<String> searchStrings = Arrays.asList("oh oh!", "really");

int count = 0;
for (String searchString : searchStrings) {
    int indexOf = input.indexOf(searchString);
    while (indexOf > -1) {
        count++;
        indexOf = input.indexOf(searchString, indexOf+1);
    }
}

If you want case insensitive search, convert both the input and the search words to lowercase. If you don't want to count words twice, replace the indexOf and the while loop with a simple contains:

int count = 0;
for (String searchString : searchStrings) {
    if (input.contains(searchString)) {
        count++;
    }
}

If you have something like god in your blacklist and don't want to match goddamn in input (for whatever reason) you need to make sure there are string boundaries around your search word. Have a look at this code:

int count = 0;
for (String searchString : searchStrings) {
    Pattern pattern = Pattern.compile("\\b" + Pattern.quote(searchString) + "\\b");
    Matcher matcher = pattern.matcher(input);
    if (matcher.find()) {
        count++;
    }
}
steffen
  • 16,138
  • 4
  • 42
  • 81
  • Thank you for your answer but for example in my arraylist I have god damn and god!!! but not god, however when I run you code gode would be counted as well which should not... is there any way to solve this? – HMdeveloper Dec 14 '15 at 23:26
  • @HamedMinaee Having `god damn` and `god!!!` would not match `god` or `gode` in your input since there are some characters missing. – steffen Dec 14 '15 at 23:35
  • Thank you yes I double checked you are right but in case of ass and passenger when we have ass in he array list it matches with the passenger as well which should not, is there anyway to avoid it? – HMdeveloper Dec 15 '15 at 00:02
0

I also don't understand exactly: is the problem that "oh oh!" should be one word? or is "!" the problem? Anyway, consider overriding "Equals" in ArrayList (I assume "sexuality" is your arraylist) to fit your needs. Check out this post: ArrayList's custom Contains method

Community
  • 1
  • 1
geri
  • 3
  • 2
0

The brute force approach is to insert all strings of sexuality list to an HashMap and then for each substring of content search for it in the map. You can limit the length of the substring to the maximum length of the words in sexuality list. However this could be really expensive, it depends on the length of content and the length of the longest word contained in sexuality

For a smarter approach you should have a look at another data structure, the trie. An implementation is available in the Apache Commons Collection 4 lib. This approach is much faster because let you stop scanning the substring as soon as you find a prefix the doesn't exist in your dictionary (in your case the sexuality list)

ugo
  • 181
  • 1
  • 8
0

If your "sentence" is not too big and your List doesn´t contain too many items I would go the easy way and do it like this:

String sentence = "Here is my my sentence";
        List<String> searchList = new ArrayList<>();
        searchList.add("is");
        searchList.add("my");
        int occurences[] = new int[searchList.size()];
        for (int i = 0; i < searchList.size(); i++) {
            int searchFromPos = 0;
            String wordToSearch = searchList.get(i);
            while ((searchFromPos = sentence.indexOf(wordToSearch, searchFromPos)) != -1) {
                occurences[i]++;
                searchFromPos += wordToSearch.length();
            }
        }

NOTE, however, that is will also detect word parts. e.g. when your sentence is "This is sneaky" and you search for "is", there wille be two results, because This also has and "is".

Lukas Makor
  • 1,947
  • 2
  • 17
  • 24