3

I have a multiline string which is delimited by a set of different delimiters,

A Z DelimiterB B X DelimiterA (C DelimiterA D) DelimiterB (E DelimiterA F) DelimiterB G DelimiterA H

I need to split that string by delimiters, but if some words are inside brackets then extract the bracket as a single word even if it contains a delimiter. I need them to be extract as follows,

A Z
DelimiterB
B X
DelimiterA
(C DelimiterA D) (extract with brackets)
DelimiterB
(E DelimiterA F)
DelimiterB
G
DelimiterA
H

Currently I am using this expression to split by delimiters,

(((?<=DelimiterA)|(?=DelimiterA))|((?<=DelimiterB)|(?=DelimiterB)))

I tried the following but it is not working. So how can I make this to work?

((?=\()|(?<=\))|(((?<=DelimiterA)|(?=DelimiterA))|((?<=DelimiterB)|(?=DelimiterB))))

Java Code,

String txt = "A DelimiterB B DelimiterA (C DelimiterA D) DelimiterB (E DelimiterA F) DelimiterB G DelimiterA H";
String[] texts = txt.split("((?=\()|(?<=\))|(((?<=DelimiterA)|(?=DelimiterA))|((?<=DelimiterB)|(?=DelimiterB))))");

for (String word : texts) {
    System.out.println(word);
}
Abraham Arnold
  • 301
  • 4
  • 20
  • Not sure why you use `split` as you need the "delimiter" in your result. From what given, below should do the job. `Scanner scanner = new Scanner(txt);scanner.findAll("\\w+|\\(\\w+ \\w+ \\w+\\)").map(matchResult -> matchResult.group()).forEach(System.out::println);` – samabcde Apr 17 '22 at 15:50
  • I tried it. But the texts between delimiters may contains 2 words like this. `A Z DelimiterB B X DelimiterA (C DelimiterA D)`. So I need to get `A Z` as one word and `B X` as one word and then `(C DelimiterA D)` like that. – Abraham Arnold Apr 20 '22 at 15:47

1 Answers1

1

IMO, Matching is easier than Splitting

Since the "delimiter" is also needed, I suggest to match the pattern we need instead. Base on the example given, we have below patterns to capture.

  1. (C DelimiterA D) - Bracket contain a word, delimiter and a word
    which is "\\(\\w+ (DelimiterA|DelimiterB) \\w+\\)"
  2. DelimiterB - Whole Delimiter.
    which is "(DelimiterA|DelimiterB)".
  3. B, B X - One or multiple words which are not delimiter.
    How to check the word is not delimiter?
    We can check the " " in between is not followed/preceded by delimiter(check Regex not operator), which is "\\w+((?<!(DelimiterA|DelimiterB))\\s(?!(DelimiterA|DelimiterB))\\w+)*".
import java.util.Scanner;

public class SplitWithCustomDelimiter {
    public static void main(String[] args) {
        String txt = "A Z DelimiterB B X DelimiterA (C DelimiterA D) DelimiterB (E DelimiterA F) DelimiterB G DelimiterA H";
        // scanner can accept different source
        Scanner scanner = new Scanner(txt);
        scanner.findAll(
                "\\(\\w+ (DelimiterA|DelimiterB) \\w+\\)" +
                "|(DelimiterA|DelimiterB)" +
                "|\\w+((?<!(DelimiterA|DelimiterB))\\s(?!(DelimiterA|DelimiterB))\\w+)*"
                )
                .map(matchResult -> matchResult.group()).forEach(System.out::println);
    }
}
samabcde
  • 6,988
  • 2
  • 25
  • 41
  • Thank you very much for the answer and appreciate the detailed explanation. But one more thing, can I match the word with special characters also? But it is optional. If the special characters found still match it? – Abraham Arnold Apr 21 '22 at 15:55
  • 1
    @AbrahamArnold Replace those `\\w` to `[a-zA-Z0-9!+#]`, add any special character you need inside `[]` should do the job. – samabcde Apr 22 '22 at 11:28
  • It did the job. Thank you very much again. – Abraham Arnold Apr 22 '22 at 16:35