AWK: Using a variable as part of a regular expression

Question

I have a text file similar to this one containing a username, a description and two time range values with German date format:

User###@###Description###@###1. August - 8. August 2016###@###1. September - 7. September 2016

Each field gets separated using the ###@### delimiter. I would like to check if a certain field (e.g. $3) contains two identical month names. If there are two month names in this specified field, the first month name should get removed, so that the output of awk is:

User###@###Description###@###1. - 8. August 2016###@###1. - 7. September 2016

Then I got the idea to create a for-loop for my bash script (with awk commands), which increments i in order to read out the month name from a predefined variable. Here you can get a more detailed look

script.sh:

m1=January; m2=February; m3=March; m4=April; m5=May; m6=June; m7=July; m8=August; m9=September; m10=October; m11=November; m12=December


    awk -F '###@###' '
    {for (i=1;i++;i<=12){ 
    count=0;
    $3 ~ 'm'i {count++};
    if (count == 2){gsub(mi,"" ,$3)}
    }}' Info.txt > Info.tmp

Unfortunately it is unable to search for the varname mi (like m1, m2, m3.. etc.)

What do I have to change in order to search a variable with a certain pattern to do some actions?

here is how can you get bash variable to awk http://stackoverflow.com/questions/19075671/how-to-use-shell-variables-in-awk-script — anand, Sep 07 '16 at 19:41
it would be better if you use an array of months and then check if months exists or not.. ` 'm' i ` does not seems to be good way — anand, Sep 07 '16 at 19:45
@anand The 'i' of mi corresponds to the month number, which has the month name in it. I was aware of your provided solution and tried it previously, but it does not explain how to handle variables when they get called using 'i'. "'mi'", 'm'i and /mi/ were only a few things I tried. I already had "-v m1=m1 m2=m2 ..." as a option in it, but I removed it for this example in case I did something wrong. — Otaku Kyon, Sep 07 '16 at 20:03
@EdMorton The special thing in this case is, that it has a for loop where "i" is part of the bash variable. So it is not a duplicate. — Otaku Kyon, Sep 07 '16 at 20:14
a) Don't do that (see [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice)) and b) It doesn't matter, it's exactly the same problem whether `i` is a variable or a constant and whether or not it's set in a loop. Again - google it. — Ed Morton, Sep 07 '16 at 20:36
why using 'mi' together rather create array of months and use m[i] — anand, Sep 08 '16 at 06:17

cxw · Accepted Answer · 2016-09-08T19:23:53.163

2

You can put the predefined names in the awk script. Something like this, maybe. (Quick hack - just about to log off for the day ;) )

awk -F ... ' BEGIN { m[1]="January"; m[2]="February"; ... } 
            {for(i=1...
             if ( $3 ~ m[i] ) { count++ }
             ...}'

Edit: For the benefit of future readers, here's the text from the OP's shortText.com link below:

awk -F '###@###' ' BEGIN{m1="Januar"; m2="February"; m3="March"; m4=April; m5=May; m6=June; m7=July; m8=August; m9=September; m10=October; m11=November; m12=December} {for (i in m){ count=0; $3 ~ (m[i] ".*" m[i]) {print ++count}; if (count == 1){sub(m[i],"" ,$3)} }}' Info.txt > Info.tmp

edited Sep 08 '16 at 19:23

answered Sep 07 '16 at 19:46

cxw

16,685
2
45
81

No, that's wrong in a couple of different ways.. – Ed Morton Sep 07 '16 at 20:38
@xcw I tried your solution, but you pretty much simplified the script. In your code you check if $3 has m[i] in it. But actually I need to know if there are >two< identical m[i]'s, so that the first month name can be deleted from the text file. I think your approach might work, but m[i] should get replaced by a regex like /m[i]*m[i]/ .... This is my current problem. I do not know how to call a variable in a regular expression. Do you? – Otaku Kyon Sep 08 '16 at 17:36
@OtakuKyon I see your point. I voted to reopen this question, but in case others don't agree, I recommend opening a separate question that just has the `awk` part and not the `bash` part. Something like "how to collapse repeated text within a field in awk?" **But for now,** for variables, `$1 ~ (m[0] ".*" m[0])` will test if the field 1 has two occurrences of `m[0]` in it. You can use a string as a regular expression. – cxw Sep 08 '16 at 17:48
I implemented your regular expression into my script, but it tells me that there is a syntax error. When I put it into a if query, nothing happens. Afterwards I modified ´count++´ to ´print ++count´, so I can check if the counter gets incremented. But it doesn't. Seems like this regex is not working. (EDIT: I wrote ´$3 ~ (m[i] ".*" m[i])´, so that this regex is compatible with my text file.) You can check the current script here: http://shortText.com/7ee3bb33 – Otaku Kyon Sep 08 '16 at 18:28
@OtakuKyon The array initialization looks like the problem - all 12 should be like `m[1]="Januar"`, with square brackets (`m[1]`, not `m1`) and double-quotes (`"Januar"`, not `Januar`). If you are still having trouble, please post another question, since the comments are not long enough for effective debugging :) . – cxw Sep 08 '16 at 18:42
Thank you, for finding my careless mistake. Finally I was able to make this script run and it is working. Of course, I will give you both a check mark and an arrow upwards. :) – Otaku Kyon Sep 08 '16 at 18:50
@OtakuKyon Great news! Thanks very much! :) – cxw Sep 08 '16 at 19:21

AWK: Using a variable as part of a regular expression

1 Answers1