1

I want to get all the message data. Such that it should look for message and all the data between curly braces of the parent message. With the below pattern, I am not getting all parent body.

 String data = "syntax = \"proto3\";\r\n" + 
            "package grpc;\r\n" + 
            "\r\n" + 
            "import \"envoyproxy/protoc-gen-validate/validate/validate.proto\";\r\n" + 
            "import \"google/api/annotations.proto\";\r\n" + 
            "import \"google/protobuf/wrappers.proto\";\r\n" + 
            "import \"protoc-gen-swagger/options/annotations.proto\";\r\n" + 
            "\r\n" + 
            "message Acc {\r\n" + 
            "    message AccErr {\r\n" + 
            "        enum Enum {\r\n" + 
            "            UNKNOWN = 0;\r\n" + 
            "            CASH = 1;\r\n" + 
            "        }\r\n" + 
            "    }\r\n" + 
            "    string account_id = 1;\r\n" + 
            "    string name = 3;\r\n" + 
            "    string account_type = 4;\r\n" + 
            "}\r\n" + 
            "\r\n" + 
            "message Name {\r\n" + 
            "    string firstname = 1;\r\n" + 
            "    string lastname = 2;\r\n" + 
            "}";
        List<String> allMessages = new ArrayList<>();
        Pattern pattern = Pattern.compile("message[^\\}]*\\}");
        Matcher matcher = pattern.matcher(data);
        while (matcher.find()) {
            String str = matcher.group();
            allMessages.add(str);
            System.out.println(str);
        }
    }
    

I am expecting response like below in my array list of string with size 2.

allMessage.get(0) should be:

message Acc {
    message AccErr {
        enum Enum {
            UNKNOWN = 0;
            CASH = 1;
        }
    }
    string account_id = 1;
    string name = 3;
    string account_type = 4;
}

and allMessage.get(1) should be:

message Name {
    string firstname = 1;
    string lastname = 2;
}
Bohemian
  • 412,405
  • 93
  • 575
  • 722
Maana
  • 640
  • 3
  • 9
  • 22

2 Answers2

2

First remove the input prior to "message" appearing at the start of the line, then split on newlines followed by "message" (include the newlines in the split so newlines that intervene parent messages are consumed):

String[] messages = data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)");

See live demo.

If you actually need a List<String>, pass that result to Arrays.asList():

List<String> = Arrays.asList(data.replaceAll("(?sm)\\A.*?(?=message)", "").split("\\R+(?=message)"));

The first regex matches everything from start up to, but not including, the first line that starts with message, which is replaced with a blank (ie deleted). Breaking the down:

  • (?sm) turns on flags s, which makes dot also match newlines, and m, which makes ^ and $ match start and end of each line
  • \\A means the very start of input
  • .*? .* means any quantity of any character (including newline as per the s flag being set), but adding ? makes this reluctant, so it matches as few characters as possible while still matching
  • (?=^message) is a look ahead and means the following characters are a start of a line then "message"

See regex101 live demo for a thorough explanation.

The split regex matches one or more line break sequences when they are followed by "message":

  • \\R+ means one or more line break sequences (all OS variants)
  • (?=message) is a look ahead and means the following characters are "message"

See regex101 live demo for a thorough explanation.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Thanks with the above approach I am getting all data in in single list of string. I want to split based on messages. So, with the above example I should get array list of size two. Editing the question. – Maana Feb 04 '22 at 21:08
  • @Maana This code produces *two* strings. Please copy paste my code into your IDE and try. You can see it produces 2 strings in the linked live demo. (I made some edits since the original posting, so you may not be using the latest version of my answer) – Bohemian Feb 04 '22 at 21:13
  • Bohemian: Can you please explain that pattern meaning if you don't mind or any reference link. – Maana Feb 04 '22 at 21:26
  • 1
    @Maana Explanations added – Bohemian Feb 04 '22 at 23:15
  • If my string contain any extra things like service below which I don't want to included in my list how to do that any suggestion ? service Service { rpc getBalance() returns (response) { } } – Maana Feb 04 '22 at 23:48
  • 1
    @Maana ask another question! (And tell me its link - there is another approach entirely that will work for more complicated situations, but it deserves its own question) – Bohemian Feb 05 '22 at 01:08
  • created new question please advice thanks- https://stackoverflow.com/questions/71021550/regex-only-returns-message-string-thats-starts-with-messages-and-string-betw – Maana Feb 07 '22 at 15:59
1

Try this for your regex. It anchors on message being the start of a line, and uses a positive lookahead to find the next message or the end of messages.

Pattern.compile("(?s)\r\n(message.*?)(?=(\r\n)+message|$)")
// or
Pattern.compile("(?s)\r?\n(message.*?)(?=(\r?\n)+message|$)")

No spliting, parsing, or managing nested braces either :)

https://regex101.com/r/Wa2xxx/1

Steven Spungin
  • 27,002
  • 5
  • 88
  • 78