0

I'm trying to get all occurrences of a bigram out of a string.

So below I have some code which does some of it.

String testString = "Lorem ipsum dolor sit amet.";

Pattern pat = Pattern.compile("\\w+ \\w+");
Matcher mat = pat.matcher(testString);

while (mat.find()) {
    System.out.println("Match: " + mat.group());
}

What I got was:

Match: Lorem ipsum

Match: dolor sit

Whereas the result I want is:

Match: Lorem ipsum

Match: ipsum dolor

Match: dolor sit

Match: sit amet

Cœur
  • 37,241
  • 25
  • 195
  • 267
Calahan
  • 35
  • 6

2 Answers2

1

Match only every single word, instead of every combination of two. Then keep the last word stored, and whenever a new word is found, store a doublet.

String testString = "Lorem ipsum dolor sit amet.";

Pattern pattern = Pattern.compile("\\w+");
Matcher matcher = pattern .matcher(testString);
String lastSingleWord = null;
List<String> results = new ArrayList<>();

while (matcher.find()) {
    String singleWord = matcher.group(0);
    if (lastSingleWord != null) {
        results.add(lastSingleWord + " " + singleWord);
    }
    lastSingleWord = singleWord;
}

Afterwards, if you want, you can output the list, or do with it as you please.

results.stream().forEach(System.out::println);
// Lorem ipsum
// ipsum dolor
// dolor sit
// sit amet
TreffnonX
  • 2,924
  • 15
  • 23
0

Try this pattern (?<= |^)(?=([^ ]+ [^ ]+))

Explanation:

(?<= |^) - positive lookbehind, assert what preceeds is space or beginning of a string ^

(?=([^ ]+ [^ ]+)) - positive lookahead, assert what follows is: [^ ]+ one or more characters other than space, space and again, one or more characters other than space

Demo.

As suggested in comments, this pattern could be slightly simplified: (?=\b([^ ]+ [^ ]+))

Another demo.

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69