Awk doesn't match all match all my entries

Question

I'm trying to make "a script" - essentially an awk command - to extract the prototypes of functions of C code in a .c file to generate automatically a header .h. I'm new with awk so I don't get all the details.

This is a sample of the source .c :

dict_t dictup(dict_t d, const char * key, const char * newval)
{

  int i = dictlook(d, key);

  if (i == DICT_NOT_FOUND) {

    fprintf(stderr, "key \"%s\" doesn't exist.\n", key);
    dictdump(d);
  }
  else {

    strncpy(d.entry[i].val, newval, DICTENT_VALLENGTH);
  }

  return d;
}


dict_t* dictrm(dict_t* d, const char * key) {

  int i = dictlook(d, key);

  if (i == DICT_NOT_FOUND) {

    fprintf(stderr, "key \"%s\" doesn't exist.\n", key);
    dictdump(d);
  }
  else {
    d->entry[i] = d->entry[--d.size];
  }
  if ( ((float)d->size)/d.maxsise < 0.25 ) {
    d->maxsize /= 2; 
    d->entry = realloc(d->entry, d->maxsize*sizeof(dictent_t*));
  }

  return d;
}

And what I want to generate :

dict_t dictup(dict_t d, const char * key, const char *newval); 
dict_t* dictrm(dict_t* d, const char * key);

My command with the full regex looks like this :

 awk '/^[a-zA-Z*_]+[:space:]+[a-zA-Z*_]+[:space:]*\(.*?\)/{ print $0 }' dict3.c

But I don't get nothing with it. So I've tried to squeeze it just to see if I can come with something. I've tried this :

awk '/^[a-zA-Z*_]+[:space:]+[a-zA-Z*_]+/{ print $0 }' dict3.c

And I get that :

dictent_t* dictentcreate(const char * key, const char * val) 
dict_t* dictcreate() 
dict_t* dictadd(dict_t* d, const char * key, const char * val) 
dict_t dictup(dict_t d, const char * key, const char * newval) 
dict_t* dictrm(dict_t* d, const char * key) {

And it's source of lots of wonder !

Why doesn't the first regex work?
And why the second has catched some of the declarations, but not all? I assure you that there is no space before any declaration. I guess it didn't catch other part of the code like variables declarations because of the indentation.
Third question, why has it catched all the line where I just need the expression?
Last one, how can I add the ; at the end of each regex?

@EdMorton I thought so but quick testing (as I was distracted) indicated that helped but that's probably just because I wasn't paying attention and other things were wrong. — Etan Reisner, Oct 15 '15 at 13:57

John1024 · Answer 1 · 2015-10-16T19:22:28.657

5

Note: the question has changed substantially since I wrote this answer.

Replace [:space:] with [[:space:]]:

$ awk '/^[a-zA-Z*_]+[[:space:]]+[a-zA-Z*_]+[[:space:]]*[(].*?[)]/{ print $0 }' dict3.c
dictent_t* dictentcreate(const char * key, const char * val)  
dict_t* dictcreate() 
void dictdestroy(*dict_t d) 
void dictdump(dict_t *d) 
int dictlook(dict_t *d, const char * key) 
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval) 
dict_t* dictrm(dict_t* d, const char * key)

The reason is that [:space:] will match any of the characters :, s, p, a, c, or e. This is not what you want.

You want [[:space:]] which will match any whitespace.

Sun/Solaris

The native Sun/Solaris awk is notoriously bug-filled. If you are on that platform, try nawk or /usr/xpg4/bin/awk or /usr/xpg6/bin/awk.

Using sed

A very similar approach can be used with sed. This uses a regex based on yours:

$ sed -n '/^[a-zA-Z_*]\+[ \t]\+[a-zA-Z*]\+ *[(]/p' dict3.c
dictent_t* dictentcreate(const char * key, const char * val)  
dict_t* dictcreate() 
void dictdestroy(*dict_t d) 
void dictdump(dict_t *d) 
int dictlook(dict_t *d, const char * key) 
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval) 
dict_t* dictrm(dict_t* d, const char * key)

The -n option tells sed not to print unless we explicitly ask it to. The construct /.../p tells sed to print the line if the regex inside the slashes is matched.

All the improvements to the regex suggested by Ed Morton apply here also.

Using perl

The above can also be adopted to perl:

perl -ne  'print if /^[a-zA-Z_*]+[ \t]+[a-zA-Z*]+ *[(]/' dict3.c

edited Oct 16 '15 at 19:22

answered Oct 14 '15 at 20:38

John1024

109,961
14
137
171

I got there by using a _real_ space ...but I guess the correct is `[[:space:]]` – Pedro Lobito Oct 14 '15 at 20:39
1

If you are sure that your files have only blanks and __no__ tabs, then using a real blank is fine. If you are _not_ sure, then use `[[:blank:]]` or `[[:space:]]`. The latter two are also unicode-safe. – John1024 Oct 14 '15 at 20:53
Nicely done; terminology quibble: I suggest not using 'blank' to mean 'spaces', given that `[[:blank:]]` and the C library function `isblank()` use 'blank' to mean either a space _or_ a tab. – mklement0 Oct 14 '15 at 21:49
1

@mklement0 Yes, your are entirely correct: ASCII `\x20` is a "space" while `[:blank:]` means either space or tab. However, `[:space:]` means any of space or tab or linefeed or newline or vertical tab or formfeed. If I was going to design a didactically-sound naming system, this wouldn't be it. – John1024 Oct 14 '15 at 22:22
1

@John1024: Good point: `[:space:]` is poorly named; should have been `[:whitespace:]`, perhaps. In prose, the trinity of 'whitespace' (all whitespace), 'blanks' (spaces and/or tabs) and 'spaces' (`\x20` only) makes sense to me. – mklement0 Oct 14 '15 at 22:28
Thank you, sorry for my stupid mistake. It appears that [[:space:]] doesn't work on my system, no problem I've replaced it with " +" because I always use one space for separation (like normal person ?). But it only works until I introduce the parenthese, I introduce it with \(, I've tried not to escape it and I get a syntax error. Why does my parenthese make it fail and how can I make it work ? – Nicolas Scotto Di Perto Oct 15 '15 at 05:03
@NicolasScottoDiPerto What is your OS? What version of awk? – John1024 Oct 15 '15 at 05:07
OS version: Debian Jessie (8), awk version: GNU Awk 4.1.1 – Nicolas Scotto Di Perto Oct 15 '15 at 05:11
@NicolasScottoDiPerto That is exactly what I am using. GNU awk has long supported `[[:space:]]` and it does not need to have `(` escaped. If either of those issues persist, let me see exactly what command you are running and exactly what output/error messages are produced. – John1024 Oct 15 '15 at 05:17
This match all return type, fnction name and eventually space after : awk '/^[a-zA-Z*_]+[ \t]+[a-zA-Z*_] */{ print $0";" }' dict3.c and it works. But this : awk '/^[a-zA-Z*_]+[ \t]+[a-zA-Z*_] *\(/{ print $0";" }' dict3.c doesn't catch anything and that : awk '/^[a-zA-Z*_]+[ \t]+[a-zA-Z*_] *(/{ print $0";" }' dict3.c produce a syntax error. – Nicolas Scotto Di Perto Oct 15 '15 at 05:51
Try: `awk '/^[a-zA-Z_*]+[ \t]+[a-zA-Z*]+ *[(]/{ print $0";" }' dict3.c` – John1024 Oct 15 '15 at 05:58
Ooooops ! Really sorry guys, I forgot I was on a distant system with ssh ! So os is SunOS 5.10 ! So I've tried nawk with same patern but I still get the same result... – Nicolas Scotto Di Perto Oct 15 '15 at 06:02
1

@NicolasScottoDiPerto to get an almost-POSIX awk with, among other things, support for character classes like `[[:space:]]` use /usr/xpg4/bin/awk on Solaris, not nawk and definitely not old, broken awk (/usr/bin/awk). Despite it's name of "New Awk", nawk is actually a very old awk with somewhat limited functionality. Lesson there - never use the word "new" when naming your software! – Ed Morton Oct 15 '15 at 11:57
Ok thank you I really appreciate the informations you give. So the problem comes from the version of awk, I can't choose another version since I'm running it at my college. The best part of it is that it has lead me to learn perl ! Which, in addition to being the best runtime for regex, as far as I read, should be more portable ! – Nicolas Scotto Di Perto Oct 15 '15 at 18:57
@NicolasScottoDiPerto If you are stuck with buggy versions of awk, you may also want to consider `sed`. `sed` is simpler than `perl` and handles this problem well. I added sample `sed` and `perl` code to the answer. – John1024 Oct 15 '15 at 19:41
It doesn't work with sed too, I guess regex aren't for solaris command... Works well in Perl ! Perl's my new hero, thank you ! – Nicolas Scotto Di Perto Oct 15 '15 at 20:00
1

@NicolasScottoDiPerto you are simply picking up the wrong version of awk and sed. You are WAY off track and about to waste a huge amount of your time if you think it's reasonable to learn perl for trivial text manipulation like this - just use a current version of the standard UNIX tools. You have been at a disadvantage since Solaris ships with extremely old versions of sed and awk as their default. – Ed Morton Oct 16 '15 at 04:17
Why perl wouldn't be worth learning ? I'm just about to learning it "superficially", I am not going deep into it, just want to know the minimum for using it as a tool for text manipulation in a comprehensive manner. To do that I think it's better to have an overall view of the language. – Nicolas Scotto Di Perto Oct 16 '15 at 05:16
Because you already have some knowledge of awk and awk is available on all UNIX boxes and is an outstanding tool for text manipulation. What you're suggesting is like saying "I'm going to stop learning C and instead learn C++ but not delve deeply into it, just learn enough to use it for the same basic functionality I can do in C". It's pointless since at that level it's just more of the same and you'll learn the new syntax but not the paradigm so you'll end up missing the point, not learning/understanding the idioms, and do the equivalent of procedural programming in C++. – Ed Morton Oct 16 '15 at 05:47

score 2 · Answer 2 · edited May 23 '17 at 12:14

2

The regexp you're trying to write would be:

$ awk '/^[[:alpha:]_][[:alnum:]_]*\**[[:space:]]+[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/' file
dictent_t* dictentcreate(const char * key, const char * val)
dict_t* dictcreate()
void dictdestroy(*dict_t d)
void dictdump(dict_t *d)
int dictlook(dict_t *d, const char * key)
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval)
dict_t* dictrm(dict_t* d, const char * key)

which written without character classes and making assumptions about your locale would be:

$ awk '/^[a-zA-Z_][a-zA-Z0-9_]*\**[ \t]+[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\([^)]*\)/' file
dictent_t* dictentcreate(const char * key, const char * val)
dict_t* dictcreate()
void dictdestroy(*dict_t d)
void dictdump(dict_t *d)
int dictlook(dict_t *d, const char * key)
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval)
dict_t* dictrm(dict_t* d, const char * key)

but:

Get/use an awk that has character classes because if it doesn't have that then who knows what else it's missing?
It's always trivial to write a script to find the strings you want but MUCH harder to NOT find the strings you DON'T want. For example, the above will match text inside comments and would fail given a declaration like int foo(int x /* always > 0 (I hope) */). When providing sample input/output you should always include some text that you think will be hard for a script to NOT select given it "looks" a lot like the text you do want to select but in the wrong context for your needs.

Note that C symbols cannot start with a number and so the regexp to match one is not [[:alnum:]_]+ but is instead [[:alpha:]_][[:alnum:]_]*. Also functions can and often do return pointers to pointers to pointers and the * can be next to the function name instead of the function return type so you REALLY should be using a regexp like this (untested since you didn't provide input of the format that this would match) if your function declarations can be any of the normal formats:

awk '/^[[:alpha:]_][[:alnum:]_]*((\*[[:space:]]*)*|(\*[[:space:]]*)*|[[:space:]]+)[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/' file

That won't of course match declarations that span lines - that is a whole other can of worms.

In general you can't parse C without a C parser but if you want something cheap and cheerful then at least run a C beautifier on the code first to try to get all the various possible layouts into one consistent format (google "C beautifier" and you also need to strip out the comments (see for example https://stackoverflow.com/a/13062682/1745001).

Given your new requirements and your new sample input/output, this is what you are asking for:

$ awk 'match($0,/^[[:alpha:]_][[:alnum:]_]*\**[[:space:]]+[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/) { print substr($0,RSTART,RLENGTH) ";" }' file
dict_t dictup(dict_t d, const char * key, const char * newval);
dict_t* dictrm(dict_t* d, const char * key);

but again - this is by no means robust given the possible layouts of C code in general. You need a C parser, a C beautifier, and/or a specialized tool to do this job (e.g. googl cscope) robustly.

edited May 23 '17 at 12:14

Community

1
1

answered Oct 15 '15 at 12:08

Ed Morton

188,023
17
78
185

It's a good idea, I should try it, definitely. In perl now ! ^^ – Nicolas Scotto Di Perto Oct 15 '15 at 19:20
Much of the good advice that Ed provides here (+1) about regular expressions will transfer to perl. – John1024 Oct 15 '15 at 19:30
@NicolasScottoDiPerto Why, oh why, would you want to use perl for text manipulation? Scripts too readable in awk? Too portable? Also your subject says "Awk..." and you tagged your question with awk, not perl - specifically asking us to help you come up with an awk solution and then saying "In perl now" is annoying at best. – Ed Morton Oct 15 '15 at 20:16
Yeah sorry but I it was more about the aim to extract declarations of functions than about awk. I told in a post that I can't access to another version of awk at my college, and since I mainly want to extract text there I cannot use awk to do this so... – Nicolas Scotto Di Perto Oct 16 '15 at 05:29
@NicolasScottoDiPerto I already told you that you CAN access another version of awk, /usr/xpg4/bin/awk, and even without that you could still use nawk without character classes. I'm actually surprised you have perl installed there - awk comes as standard with ALL UNIX installations but perl doesn't (I usually don't have it on the machines I use at work). – Ed Morton Oct 16 '15 at 05:41
Sorry that I didn't try it, you were right it works :) Ok so how can I add the ; at the end of each line ? – Nicolas Scotto Di Perto Oct 16 '15 at 18:47
I'm not sure what you mean (you should edit your question to show the sample input and expected output related to this) but if you're asking how to specify that the string matching a regexp must end in `;` then you just add `;$` to the end of the regexp and depending on your needs you might want to make it `.*;$` (`$` is the regexp `end of string` delimiter just like `^` is the `start of string` delimiter). – Ed Morton Oct 16 '15 at 18:52
And thank you as well ! Also, I only need the regex, not more. Because in the file there's a declaration with the { on the same line and it is printed by awk whereas it's not intended by the regex. – Nicolas Scotto Di Perto Oct 16 '15 at 18:54
Again, I don't know what that means. If you have test cases where the provided script doesn't do what you want you should edit your question to update the sample input and expected output to include examples of those cases. – Ed Morton Oct 16 '15 at 18:55
I'm catching the functions definitions of my C file to generate automatically the appropriate header with all the prototypes. The thing is that in the C source it's declarations, it doesn't end with the character ; and it should in the header. So I'm catching some regex and after each catch I want to add the character ;. Does that make sense ? For the { thing it prints dict_t* dictrm(dict_t* d, const char * key) { this is the full line, other line doesn't contain the { (it's on the next) so I guess it has printed full line for each regex – Nicolas Scotto Di Perto Oct 16 '15 at 18:58
I think I have an idea what you're getting at but why not just edit your question to show EXACTLY what you're getting at? I think you're mixing up the terms `declaration` and `definition`, btw, so that's not helping. I now THINK that instead of wanting to extract the declarations of functions as you said, you're really trying to extract the definitions of functions to create declarations from. Again - just show us. – Ed Morton Oct 16 '15 at 19:00
OK, and what about cases where the return type is on a different line from the function name? How about when arguments are spread across lines? etc, etc, Don't you need to handle those too? – Ed Morton Oct 16 '15 at 20:02
The question is not about the regex but more about awk fonctionement, I don't understand why awk catch more than the regex since the regex should end on a ). And I'd like to know how to concatenate a ; at the end of each match. I will learn awk. – Nicolas Scotto Di Perto Oct 17 '15 at 06:30
When you execute `grep "regexp" file` or `sed -n '/regexp/p' file` or `awk '/regexp/' file` you are saying "print the **line** from `file` that `regexp` occurs on". You are NOT saying "print the **string** from `file` that matches the `regexp`". `awk '/regexp/' file` is shorthand for `awk '/regexp/{print $0}' file` so If you want to append a string to the line awk is outputting, just make it explicit `awk '/regexp/{print $0 ";"}' file`. – Ed Morton Oct 17 '15 at 12:53
I edited my answer to show at the end how to do what I now think you are asking for. Get the book Effective Awk Programming, 4th Edition, by Arnold Robins. Also, get GNU awk as without it you are missing a TON of extremely useful functionality. – Ed Morton Oct 17 '15 at 13:10
1

Nice, thank you for the suggestion, I'll take a look at it – Nicolas Scotto Di Perto Oct 18 '15 at 07:31

Awk doesn't match all match all my entries

2 Answers2

Sun/Solaris

Using sed

Using perl

Linked