Help extracting a block of text between matching curly braces in a c-like language

Question

I have some documentation that I made for an HDF5 file format, which is written in the GraphViz dot language. (This is a C-like language with lots of curly braces.) This master file contains numerous elements like this:

subgraph cluster_clustername { 
                              ...
                              lots of stuff including more curly braces spanning multiple lines
                              ...
                              }

that I want to extract this block of text based on clustername. (I would like to create graphs of these subgraphs individually instead of a super large graph containing everything. Each subraph cluster is an individual HDF5 file which are connected through HDF5 external softlinks.)

There should be a way to extract this desired hunk of text (an exercise in matching the first { after some specific pattern of text and the closing } across multiple lines with nesting. This seems like it should be a relatively common task because of the prevalence of C and C-like languages.

In my mind the top candidate tools for accomplishing this are:

awk

python

gvpr - graph stream editor provided with graphviz (but this won't be helpfull to others, say C programmers with the same question and few examples exist on the web and the syntax is confusing)

sed

Currently I maintain the master file, then update each of the derived files in Emacs using M-x ediff-regions-linewise but I need an automated (so I can use Make to build documentation files) and robust method of generating the derived files. The only above tool which I have modest experience with is sed but because the pattern is complicated and spans multiple lines I think a tool like awk or python might be better suited to the task.

In fact I tried a technique similar to reference counting in awk but I am running into problems understanding some of the more subtle behaviors of awk and have only really used awk one liners in the past.

Thanks so much in advance for any help you have. -Z

A similar question for a regex solution has been asked [here](http://stackoverflow.com/questions/1430355/regular-expression-for-content-within-braces), so regex is depending on your regex engine and not trivial, would not be my first choice. — stema, Mar 30 '11 at 21:20
Does your "Lots of stuff" include string literals or comments containing (non-significant) curly braces which should be ignored? — ridgerunner, Mar 30 '11 at 21:30

score 1 · Answer 1 · answered Mar 30 '11 at 21:07

1

Using Perl, you'd use the Text::Balanced module. It can return you text before, inside, and after balanced delimiters.

answered Mar 30 '11 at 21:07

CanSpice

34,814
10
72
86

Thanks for the update. I'll google around a bit, but be warned I have exactly ZERO perl experience. If anybody has a more OTS solution I'd certainly appreciate it. In the mean time I'll explore this. – zbeekman Mar 30 '11 at 22:45

score 1 · Answer 2 · answered Mar 31 '11 at 11:50

I can't tell you this is the best or most elegant solution, but I've used this python function before and it works. It won't handle unbalanced brackets in comments or string literals, but does handle nested brackets. Use like token = get_token_between_chars(string_to_parse, '{', '}')

def get_token_between_chars(string, start_char, end_char):
  token = ''

  n_left = 0
  n_right = 0
  closed = False

  start_index = 0
  end_index = 0
  count = 0

  for c in string:
    if c == start_char:
      n_left += 1
      if n_left == 1:
        start_index = count
    elif c == end_char:
      n_right += 1

    if n_left > n_right and not closed:
      token += c
    elif n_left > 0 and n_left == n_right:
      closed = True
      end_index = count
      break

    count += 1

  token = token[1 : len(token)]
  return [start_index, token, end_index+1]

Thanks Dan. I'll take a look at this to make sure I understand everything then give it a shot. Together with the python regex module I think I should be able to make this work. — zbeekman, Mar 31 '11 at 13:38

score 0 · Answer 3 · answered Mar 31 '11 at 00:06

You can use awk or any programming language with good string processing capabilities. For example, split the text using some prominent pattern. eg Say that "subgraph" separates each block and you want to get cluster_A, you can do this

$ cat file
subgraph cluster_A {
                              ...
                              lots of stuff more curly {
                          }
                              ...
                              }

subgraph cluster_B {
                              ...
                              lots of stuff including more curly braces spanning multiple lines
                              ...
                              }

$ awk 'BEGIN{RS="subgraph"} /cluster_A/{ print "subgraph "$0} ' file
subgraph  cluster_A {
                              ...
                              lots of stuff more curly {
                          }
                              ...
                              }

The problem is that between each subgraph there is other stuff I don't want (edges connected different components of each subgraph together). If it were simply a matter of getting text between line pattern1 and line pattern2 one could easily do: `sed -n '/pattern1/,/pattern2/p' filename.dot` The only way to find the ed of the block is to find the matching } that closes the block. — zbeekman, Mar 31 '11 at 13:33

Help extracting a block of text between matching curly braces in a c-like language

3 Answers3