32

I have a String that I have to parse for different keywords. For example, I have the String:

"I will come and meet you at the 123woods"

And my keywords are

'123woods'
'woods'

I should report whenever I have a match and where. Multiple occurrences should also be accounted for.

However, for this one, I should get a match only on '123woods', not on 'woods'. This eliminates using String.contains() method. Also, I should be able to have a list/set of keywords and check at the same time for their occurrence. In this example, if I have '123woods' and 'come', I should get two occurrences. Method execution should be somewhat fast on large texts.

My idea is to use StringTokenizer but I am unsure if it will perform well. Any suggestions?

informatik01
  • 16,038
  • 10
  • 74
  • 104
Nikola Yovchev
  • 9,498
  • 4
  • 46
  • 72

14 Answers14

49

The example below is based on your comments. It uses a List of keywords, which will be searched in a given String using word boundaries. It uses StringUtils from Apache Commons Lang to build the regular expression and print the matched groups.

String text = "I will come and meet you at the woods 123woods and all the woods";

List<String> tokens = new ArrayList<String>();
tokens.add("123woods");
tokens.add("woods");

String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

If you are looking for more performance, you could have a look at StringSearch: high-performance pattern matching algorithms in Java.

Chris
  • 7,864
  • 1
  • 27
  • 38
  • What if I have an ArrayList and I want to use a Pattern to build it? Seems like I have to use the trusty old StringBuilder? – Nikola Yovchev Feb 23 '11 at 13:06
  • 1
    @baba - You could do that, or you could iterate through the List<>. I'm not sure which would be more efficient, you may want to try both approaches if performance is a concern. – user Feb 23 '11 at 13:12
  • Personally I would prefer to iterate through the list. Added this option to my answer. – Chris Feb 23 '11 at 13:30
  • I ment what if I have an ArrayList of keywords to search in the String. For example, my ArrayList will consist of woods and 123woods and some other words. I would have to use a StringBuilder while iterating it in order to construct the Pattern. Then, when the Pattern finds a match, I would have to look up my ArrayList in order to see which one of my keyWords was matched. Also, substring is famously known for its bad performance and it will create a new String object in a loop. To me, it seems like there should be a way better solution for the problem, but I can't figure what. – Nikola Yovchev Feb 23 '11 at 14:45
  • 1
    @baba: Now I begin to see. I updated my answer based on your comment. – Chris Feb 23 '11 at 15:22
  • what if there is a special character in the token? – njzk2 Jun 06 '14 at 14:32
  • 3
    With Java 8, no need for `StringUtils` any more. `String` has static `join()` method that can do the job. – Ahmad Shahwan Feb 15 '19 at 14:44
  • @njzk2 We would have a possible problem. For avoiding it you would need to do `Pattern.quote()` on every token – reallynice Mar 05 '20 at 08:58
20

Use regex + word boundaries as others answered.

"I will come and meet you at the 123woods".matches(".*\\b123woods\\b.*");

will be true.

"I will come and meet you at the 123woods".matches(".*\\bwoods\\b.*");

will be false.

morja
  • 8,297
  • 2
  • 39
  • 59
12

Hope this works for you:

String string = "I will come and meet you at the 123woods";
String keyword = "123woods";

Boolean found = Arrays.asList(string.split(" ")).contains(keyword);
if(found){
      System.out.println("Keyword matched the string");
}

http://codigounico.blogspot.com/

9

How about something like Arrays.asList(String.split(" ")).contains("xx")?

See String.split() and How can I test if an array contains a certain value.

Community
  • 1
  • 1
user
  • 6,897
  • 8
  • 43
  • 79
4

Got a way to match Exact word from String in Android:

String full = "Hello World. How are you ?";

String one = "Hell";
String two = "Hello";
String three = "are";
String four = "ar";


boolean is1 = isContainExactWord(full, one);
boolean is2 = isContainExactWord(full, two);
boolean is3 = isContainExactWord(full, three);
boolean is4 = isContainExactWord(full, four);

Log.i("Contains Result", is1+"-"+is2+"-"+is3+"-"+is4);

Result: false-true-true-false

Function for match word:

private boolean isContainExactWord(String fullString, String partWord){
    String pattern = "\\b"+partWord+"\\b";
    Pattern p=Pattern.compile(pattern);
    Matcher m=p.matcher(fullString);
    return m.find();
}

Done

Hiren Patel
  • 52,124
  • 21
  • 173
  • 151
3

Try to match using regular expressions. Match for "\b123wood\b", \b is a word break.

Axel
  • 13,939
  • 5
  • 50
  • 79
3
public class FindTextInLine {
    String match = "123woods";
    String text = "I will come and meet you at the 123woods";

    public void findText () {
        if (text.contains(match)) {
            System.out.println("Keyword matched the string" );
        }
    }
}
pushkin
  • 9,575
  • 15
  • 51
  • 95
Lina
  • 39
  • 1
  • While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – awh112 Jun 21 '18 at 13:43
2

The solution seems to be long accepted, but the solution could be improved, so if someone has a similar problem:

This is a classical application for multi-pattern-search-algorithms.

Java Pattern Search (with Matcher.find) is not qualified for doing that. Searching for exactly one keyword is optimized in java, searching for an or-expression uses the regex non deterministic automaton which is backtracking on mismatches. In worse case each character of the text will be processed l times (where l is the sum of the pattern lengths).

Single pattern search is better, but not qualified, too. One will have to start the whole search for every keyword pattern. In worse case each character of the text will be processed p times where p is the number of patterns.

Multi pattern search will process each character of the text exactly once. Algorithms suitable for such a search would be Aho-Corasick, Wu-Manber, or Set Backwards Oracle Matching. These could be found in libraries like Stringsearchalgorithms or byteseek.

// example with StringSearchAlgorithms

AhoCorasick stringSearch = new AhoCorasick(asList("123woods", "woods"));

CharProvider text = new StringCharProvider("I will come and meet you at the woods 123woods and all the woods", 0);

StringFinder finder = stringSearch.createFinder(text);

List<StringMatch> all = finder.findAll();
CoronA
  • 7,717
  • 2
  • 26
  • 53
1

A much simpler way to do this is to use split():

String match = "123woods";
String text = "I will come and meet you at the 123woods";

String[] sentence = text.split();
for(String word: sentence)
{
    if(word.equals(match))
        return true;
}
return false;

This is a simpler, less elegant way to do the same thing without using tokens, etc.

ulu5
  • 439
  • 7
  • 11
  • While simpler to understand and write, it is not the answer of the question I was asking. I have two or three, or maybe indefinite number of "match" keywords, I need to get those that were found in the "text". Of course, you might loop my "match" keywords for each of the "words" on the split text, but I find it far less elegant than the already accepted solution. – Nikola Yovchev Oct 11 '12 at 07:55
0

To Match "123woods" instead of "woods" , use atomic grouping in regular expresssion. One thing to be noted is that, in a string to match "123woods" alone , it will match the first "123woods" and exits instead of searching the same string further.

\b(?>123woods|woods)\b

it searches 123woods as primary search, once it got matched it exits the search.

tckmn
  • 57,719
  • 27
  • 114
  • 156
0

Looking back at the original question, we need to find some given keywords in a given sentence, count the number of occurrences and know something about where. I don't quite understand what "where" means (is it an index in the sentence?), so I'll pass that one... I'm still learning java, one step at a time, so I'll see to that one in due time :-)

It must be noticed that common sentences (as the one in the original question) can have repeated keywords, therefore the search cannot just ask if a given keyword "exists or not" and count it as 1 if it does exist. There can be more then one of the same. For example:

// Base sentence (added punctuation, to make it more interesting):
String sentence = "Say that 123 of us will come by and meet you, "
                + "say, at the woods of 123woods.";

// Split it (punctuation taken in consideration, as well):
java.util.List<String> strings = 
                       java.util.Arrays.asList(sentence.split(" |,|\\."));

// My keywords:
java.util.ArrayList<String> keywords = new java.util.ArrayList<>();
keywords.add("123woods");
keywords.add("come");
keywords.add("you");
keywords.add("say");

By looking at it, the expected result would be 5 for "Say" + "come" + "you" + "say" + "123woods", counting "say" twice if we go lowercase. If we don't, then the count should be 4, "Say" being excluded and "say" included. Fine. My suggestion is:

// Set... ready...?
int counter = 0;

// Go!
for(String s : strings)
{
    // Asking if the sentence exists in the keywords, not the other
    // around, to find repeated keywords in the sentence.
    Boolean found = keywords.contains(s.toLowerCase());
    if(found)
    {
        counter ++;
        System.out.println("Found: " + s);
    }
}

// Statistics:
if (counter > 0)
{
    System.out.println("In sentence: " + sentence + "\n"
                     + "Count: " + counter);
}

And the results are:

Found: Say
Found: come
Found: you
Found: say
Found: 123woods
In sentence: Say that 123 of us will come by and meet you, say, at the woods of 123woods.
Count: 5

0

You can use regular expressions. Use Matcher and Pattern methods to get the desired output

Deepak
  • 2,094
  • 8
  • 35
  • 48
0

You can also use regex matching with the \b flag (whole word boundary).

Rune Aamodt
  • 2,551
  • 2
  • 23
  • 27
0

If you want to identify a whole word in a string and change the content of that word you can do this way. Your final string stays equals, except the word you treated. In this case "not" stays "'not'" in final string.

    StringBuilder sb = new StringBuilder();
    String[] splited = value.split("\\s+");
    if(ArrayUtils.isNotEmpty(splited)) {
        for(String valor : splited) {
            sb.append(" ");
            if("not".equals(valor.toLowerCase())) {
                sb.append("'").append(valor).append("'");
            } else {
                sb.append(valor);
            }               
        }
    }
    return sb.toString();
Manuel Franqueira
  • 326
  • 1
  • 4
  • 13