1

I would like a regular expression to match a C Structure define. This is my target data:

typedef struct
{
}dontMatchThis;

typedef struct
{
  union //lets have a union as well
  {
    struct 
    {
     int a
     //a comment for fun

     int b;
     int c;
    };
    char byte[10];
  };
}structA;

I want to match the define of structA only, from typedef to strunctA.

I have tried : typedef[\s\S]+?structA

But event though I'm using the non-greedy modifier this is matching both structures. Any suggestions

Nicolas Kaiser
  • 1,628
  • 2
  • 14
  • 26
user2370532
  • 11
  • 1
  • 2
  • 3
    I'm fairly certain that C/C++ syntax is not a regular language, hence regular expressions are probably not the appropriate tool for parsing it... – twalberg May 10 '13 at 16:25
  • If OP is is looking for a specific pattern (e.g., *this* specific example), a regex should be able to find it. After all, if one makes a regex with exactly these characters it is looking for string-identity, and regexes do that just fine. The question is how much can you generalize ("patternize"), and what patterns does OP actually need? If OP wants to match structs that look *like* this but contain other nested substructures, then regexes cannot do the job. – Ira Baxter May 10 '13 at 16:49

4 Answers4

1

In the general case, it is simply not possible. The typedef or the struct could have been generated by preprocessor macro invocations (and you could have typedef in one file, and struct in another #include-d file, or struct coming from one preprocessor macro, and typedef from another one.).

I would suggest instead to extend or customize the GCC compiler, either thru a plugin or a MELT extension (MELT is a domain specific language to extend GCC).

See also etags

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
1

The problem is the point where the regexp begins matching. It correctly starts matching at the first typedef and continues until structA.

It's really difficult (I would say impossible to do correctly) what you're trying to do. You would need to match nested braces to see where the struct stops.

See Building a Regex Based Parser.

Community
  • 1
  • 1
ctn
  • 2,887
  • 13
  • 23
0

I found the following works for me:

([\s\S])(typedef([\s\S])?structA)

I then select the second group, which has my structure in. This uses the first [\s\S] as a greedy operator to match all the defines before the target struct.

user2370532
  • 11
  • 1
  • 2
0

As stated by ctn The problem with the non-greedy modifier as stated in your regex is that it starts looking for the first definition of typedef and will stop at the first place where it finds structA. Everything in between is considered as valid. A way to use regex to solve your problem is to define a regex which identifies the structs, and later in a separate stage you verify if the match corresponds to the struct that you want.

For example, using the regex:

(typedef[\s\S]+?})\s*([a-zA-Z0-9_]+)\s*;

you will define 2 groups, where the first starts at a typedef and ends at a curly brace, with a non-greedy text matching. This first group contains the string that you might want. The final curly brace is followed by the struct name ([a-zA-Z0-9_]+) and ends with ;. Considering your example, there will be 2 matches, each containing 2 groups.

Match 1:

(typedef struct
{
})(dontMatchThis);

Value of group 2: dontMatchThis

Match 2:

(typedef struct
{
  union //lets have a union as well
  {
    struct 
    {
     int a
     //a comment for fun

     int b;
     int c;
    };
    char byte[10];
  };
})(structA);

Value of group 2: structA

Thus, it becomes a matter of verifying if the value of the group 2 corresponds to structA.