2

I got the 2 texts:

First one: My favorite programming language is c++.

Second one: My favorite programming language is c.

and want to seek for c and c++ in those texts separately.

For finding c I can write: \bc\b then: first text is bad! and second one is good. I tried also: \bc^\+\b but doesn't work. For fiding c++ I tried for example: \bc\+\+\b but then first and second doesn't work. Help please.

EDIT:

And what if the text will be I programme in c++ a lot! ?

EDIT:

Here is the unit test which I need to fulfill:

package adhoc;

import java.util.HashSet;
import java.util.Set;

import org.junit.Test;

import junit.framework.TestCase;

public class FinderProgrammingTechnologyInTextTest extends TestCase{

    @Test
    public void testFind() {
        // Given:
        Set<String> setOfProgrammingLanguagesToSeek = new HashSet<>();
        setOfProgrammingLanguagesToSeek.add("java");
        setOfProgrammingLanguagesToSeek.add("perl");
        setOfProgrammingLanguagesToSeek.add("c");
        setOfProgrammingLanguagesToSeek.add("c++");

        // When:
        FinderProgrammingTechnologyInText finder = new FinderProgrammingTechnologyInText(
                setOfProgrammingLanguagesToSeek);
        Set<String> result = finder.find("java , perl! c++ and other staff");

        // Then:
        assertTrue(result.contains("java"));
        assertTrue(result.contains("perl"));
        assertFalse(result.contains("c"));
        assertTrue(result.contains("c++"));
    }

}

by changing ONLY the argument for compile() method:

package adhoc;

import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class FinderProgrammingTechnologyInText {

    Set<String> setOfTechnologiesToSearch;

    public FinderProgrammingTechnologyInText(Set<String> x) {
        this.setOfTechnologiesToSearch = x;
    }

    public Set<String> find(String text) {
        Set<String> result = new HashSet<>();
        return setOfTechnologiesToSearch.stream()
                .filter(x -> Pattern
                        .compile(x)  // change only this line
                        .matcher(text).find()
                        ) 
                .collect(Collectors.toSet());       
    }
}
W W
  • 769
  • 1
  • 11
  • 26
  • Couldn't you just look for the last word in the sentence? – wp78de Oct 06 '17 at 19:25
  • Use `(?<!\w)c\+\+(?!\w)`, `String p = "(?<!\\w)c\\+{2}(?!\\w)";` – Wiktor Stribiżew Oct 06 '17 at 19:38
  • it doesn't work only for seeking `c` – W W Oct 06 '17 at 20:00
  • My solution works, but as far as I can see, it might do more than you need as it matches `c++` as a whole word. You pass a literal string to a `Pattern`, you must `Pattern.quote()` it. So, `.compile(x)` must be replaced with `.compile(Pattern.quote(x))`. And to match a whole word, it must be `.compile("(?<!\\w)" + Pattern.quote(x) + "(?!\\w)")` – Wiktor Stribiżew Oct 06 '17 at 20:09
  • If you need to match a whole word that may start/end with special chars, yes, `.compile("(?<!\\w)" + Pattern.quote(x) + "(?!\\w)")` – Wiktor Stribiżew Oct 06 '17 at 20:13
  • then the third assertion fails. Look carefully, there is `assertFalse` – W W Oct 06 '17 at 20:14
  • Ok, let's assume your word boundaries are non-word and non-symbol chars. Then use `.compile("(?<![\\w\\p{S}])" + Pattern.quote(x) + "(?![\\w\\p{S}])")` – Wiktor Stribiżew Oct 06 '17 at 20:18
  • thx, now it works, ya are good programmer. I must analyse it now. – W W Oct 06 '17 at 20:21

2 Answers2

3

Replace .compile(x) line with

.compile("(?<![\\w\\p{S}])" + Pattern.quote(x) + "(?![\\w\\p{S}])")

Here, (?<![\w\p{S}]) is a negative lookbehind that will make sure there is no word or symbol char immediately to the left of the current location, and (?![\w\p{S}]) negative lookahead will make sure there is no word or symbol char immediately to the right of the current location (that is, word and symbol chars are your allowed "word" chars now).

See a sample regex demo for a c++ keyword at regex101.com.

Since the search words are passed as literal char sequences to Pattern, they must be escaped, and that is what Pattern.quote(x) is doing in the code.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

You could you just look for the last word in the sentence before the dot.

[\w+]+(?=\.$)

https://regex101.com/r/aPYDTE/1

The problem with your pattern is that the plus sign is not a word and therefore the word boundary \b does not match. If you would use the dot as anchor you would get a match \b(c\+\+)\.

If you are just want to match c/c++ and other languages try \W(c\+\+|css|c|java)\W
I have added a non-word \W as boundary. Adding a look around allows you to use the full match instead of using the capturing group $1.

(?<=\W)(c\+\+|css|c|java)(?=[^\w\+])

https://regex101.com/r/qWnOsB/4

wp78de
  • 18,207
  • 7
  • 43
  • 71