1

I have a string like:

Fields  { name:"aa" type: "bb" paramA { name:"cc" } paramB { other:"ee" other_p:"ff"} paramC { name: "bb" param: "dd" other_params { abc: "xx" xyz:"yy"}} }

My regex code in Java extract all that is between brackets for paramA, paramB and other_params. I need somehow to structure this in a Java object, but I am stucked at paramC extraction.

Pattern pattern=Pattern.compile("\\w+\\s(\\{([^{]*?)\\})");
Matcher matcher=pattern.matcher(theAboveString);
while (matcher.find()){
System.out.println(matcher.group(1);
}

My code for the extraction

Amrida D
  • 329
  • 1
  • 5
  • 17

2 Answers2

2

You can't parse infinitely nest-able nodes with regex. (See Chomsky's categorization of languages/automatons, or any stackoverflow question about parsing HTML with regex.)

I've made a library that let's you parse things like this. It even has proper documentation.

http://sourceforge.net/projects/jparser2/

Documentation:

http://sourceforge.net/projects/jparser2/files/doc/

zslevi
  • 409
  • 2
  • 10
0

Here's an example of parsing using regex:

String input = "Fields  { name:\"aa\" type: \"bb\" paramA { name:\"cc\" } paramB { other:\"ee\" other_p:\"ff\"} paramC { name: \"bb\" param: \"dd\" other_params { abc: \"xx\" xyz:\"yy\"}} }";
Matcher m = Pattern.compile("\\s*(?:(\\w+)\\s*(?::\\s*(\".*?\")|\\{)|\\})\\s*").matcher(input);
int start = 0;
Deque<String> stack = new ArrayDeque<>();
while (m.find()) {
    if (m.start() != start)
        throw new IllegalArgumentException("Invalid data at " + start);
    if (m.group(2) != null) {
        System.out.println(stack + " : " + m.group(1) + " = " + m.group(2));
    } else if (m.group(1) != null) {
        //System.out.println(m.group(1) + " {");
        stack.addLast(m.group(1));
    } else {
        //System.out.println("}");
        if (stack.isEmpty())
            throw new IllegalArgumentException("Unbalanced brace at " + start);
        stack.removeLast();
    }
    start = m.end();
}
if (start != input.length())
    throw new IllegalArgumentException("Invalid data at " + start);
if (! stack.isEmpty())
    throw new IllegalArgumentException("Unexpected end of text");

Output

[Fields] : name = "aa"
[Fields] : type = "bb"
[Fields, paramA] : name = "cc"
[Fields, paramB] : other = "ee"
[Fields, paramB] : other_p = "ff"
[Fields, paramC] : name = "bb"
[Fields, paramC] : param = "dd"
[Fields, paramC, other_params] : abc = "xx"
[Fields, paramC, other_params] : xyz = "yy"

You should be able to take it from here.

UPDATE

To also support numeric values, use this regex:

"\\s*(?:(\\w+)\\s*(?::\\s*(\".*?\"|[-+0-9.eE]+)|\\{)|\\})\\s*"

Testing with "Layer { name: \"conv2\" type: \"Convolution\" bottom: \"norm1\" top: \"conv2\" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 2 weight_filler { type: \"gaussian\" std: 0.01 } bias_filler { type: \"constant\" value: 1 } }}" produces:

[Layer] : name = "conv2"
[Layer] : type = "Convolution"
[Layer] : bottom = "norm1"
[Layer] : top = "conv2"
[Layer, param] : lr_mult = 1
[Layer, param] : decay_mult = 1
[Layer, param] : lr_mult = 2
[Layer, param] : decay_mult = 0
[Layer, convolution_param] : num_output = 256
[Layer, convolution_param] : pad = 2
[Layer, convolution_param] : kernel_size = 5
[Layer, convolution_param] : group = 2
[Layer, convolution_param, weight_filler] : type = "gaussian"
[Layer, convolution_param, weight_filler] : std = 0.01
[Layer, convolution_param, bias_filler] : type = "constant"
[Layer, convolution_param, bias_filler] : value = 1
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • That's a great solution. I have only one more problem. The iput field may be something like: Fields { name:"aa" type: "bb" paramA { name:"cc" } paramB { other:"ee" other_p:"ff"} }. In this case should I have a different Regex for each different String or type of Field? – Amrida D Dec 01 '15 at 16:41
  • Sorry for bothering you. I tried the code but the following string it does not work. I tried to change the Regex but with no succes to have good results for the following: "Layer { name: \"conv2\" type: \"Convolution\" bottom: \"norm1\" top: \"conv2\" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 2 weight_filler { type: \"gaussian\" std: 0.01 } bias_filler { type: \"constant\" value: 1 } }}" . Can you please have a look? – Amrida D Dec 02 '15 at 10:13
  • I used this regex: Matcher m=Pattern.compile("\\s*(?:(\\w+)\\s*(?::\\s*((\".*?\")|(\\s*\\d+\\s*)|(\\s*\\w+\\s*))|\\{)|\\})\\s*").matcher(layer); – Amrida D Dec 02 '15 at 12:01
  • It works also for numbers but I get as output the following: [Layer] : name = "conv1" [Layer] : type = "Convolution" [Layer] : bottom = "data" [Layer] : top = "conv1" [Layer, param] : lr_mult = 1 [Layer, param] : decay_mult = 1 [Layer, param] : lr_mult = 2 [Layer, param] : decay_mult = 0 [Layer, convolution_param] : num_output = 96 [Layer, convolution_param] : kernel_size = 11 [Layer, convolution_param] : stride = 4 [Layer, convolution_param, weight_filler] : type = "gaussian" [Layer, convolution_param, weight_filler] : std = 0. Don't know where the error is – Amrida D Dec 02 '15 at 12:02
  • 1
    Use this: `"\\s*(?:(\\w+)\\s*(?::\\s*(\".*?\"|[-+0-9.eE]+)|\\{)|\\})\\s*"` – Andreas Dec 02 '15 at 15:09
  • It does not work better than mine: It extracts only ` [Layer] : name = "data" [Layer] : type = "Data" [Layer] : top = "data" [Layer] : top = "label"` – Amrida D Dec 02 '15 at 15:19
  • With mine:`"\\s*(?:(\\w+)\\s*(?::\\s*((\".*?\")|(\\s*\\d+\\s*)|(\\s*\\w+\\s*))|\\{)|‌​\\})\\s*"` I get: [Layer] : name = "conv1" [Layer] : type = "Convolution" [Layer] : bottom = "data" [Layer] : top = "conv1" [Layer, param] : lr_mult = 1 [Layer, param] : decay_mult = 1 [Layer, param] : lr_mult = 2 [Layer, param] : decay_mult = 0 [Layer, convolution_param] : num_output = 96 [Layer, convolution_param] : kernel_size = 11 [Layer, convolution_param] : stride = 4 [Layer, convolution_param, weight_filler] : type = "gaussian" [Layer, convolution_param, weight_filler] : std = 0 – Amrida D Dec 02 '15 at 15:20
  • 1
    I just ran it with that `"Layer { ..."` string, and it worked fine with the regex I just gave. – Andreas Dec 02 '15 at 15:26
  • One more question: why when I print the stack outside the while it is empty? – Amrida D Dec 02 '15 at 17:03
  • 1
    Because it has to be, see last `if` statement. The values are removed from the stack when an end-brace (`}`) is processed. If the stack isn't empty when done, it means that end-braces are missing. – Andreas Dec 02 '15 at 17:40
  • I succeed in the end to parse all and to modify the values for the attributes. Now the question is how I can write the deque to the file to have the same file format? – Amrida D Dec 16 '15 at 14:49
  • @AmridaD That sounds like something you should ask in a new question. – Andreas Dec 16 '15 at 15:56
  • Ok Andreas: http://stackoverflow.com/questions/34318327/java-parsing-string-read-write-to-file – Amrida D Dec 16 '15 at 17:26