0

I have the following awk script:

#! /usr/bin/awk -f

BEGIN{
FS=OFS="\t";

split("Mtm|^Dept|^Im|^TT|ge[mn]$,^Mini", arr1, ",")
split("Variable Expenses,Fixed Expenses", arr2, ",")
itm="Accounting & Legal";
}

$3<0 && $4~arr1[1] {$7=itm; $8=arr2[1]; print $0}
$3<0 && $4~arr1[2] {$7=itm; $8=arr2[2]; print $0}

Which produces the following output:

c_wp_e_2025 2021-05-10  -120.00 8430Ken Jebsen  BILL PAYMENT    12:08-11662 Accounting & Legal  Variable Expenses
c_wp_e_2148 2021-07-27  -90.85  6450Barrie L. Jorgen    BILL PAYMENT    09:31-35651 Accounting & Legal  Variable Expenses
e_wp_e_789  2018-06-12  -190.00 Immigration NZ 11   DEBIT       Accounting & Legal  Variable Expenses
e_wp_e_818  2018-06-29  -153.43 Dept of Home Affa 26    DEBIT   AUD 141.37  Accounting & Legal  Variable Expenses
e_wp_e_839  2018-07-18  -67.47  TT Services New Z 16    DEBIT       Accounting & Legal  Variable Expenses
e_wp_e_2512 2021-07-27  -470.20 Dept Internal Aff 23    DEBIT       Accounting & Legal  Variable Expenses
c_az_262    2021-10-01  -210.45 Mtm Accounting Ltd. Mlm Accounting  Accounting & Legal  Variable Expenses
c_az_421    2021-07-19  -561.20 Paypal *Mtm     Paypal *Mtm Accounting & Legal  Variable Expenses
c_az_1082   2020-11-25  -52.20  Ministryofbusiness  285515Dpsa2A    Mbie    Accounting & Legal  Fixed Expenses
c_az_1110   2020-11-16  -2228.70    Mtm Accounting Ltd. Mlm Accounting  Accounting & Legal  Variable Expenses

This works as desired. I have however many more categories besides itm="Accounting & Legal" (e.g. itm="Bank_charges) and sometimes there are Fixed Expenses and Variable Expenses. At other times one of them only in which case the arrays created in the splits would only have one value. In such a case the second statement in the main body would become unused. For every new category there are different search patterns. I am new to writing awk programs and am stuck on how to approach this. How could this be written efficiently to accommodate the above scenario.

The input (excerpt) would be:

c_wp_e_2025     2021-05-10      -120.00 8430Ken Jebsen    BILL PAYMENT    12:08-11662
c_wp_e_2148     2021-07-27      -90.85  6450Barrie L. Jorgen    BILL PAYMENT    09:31-35651
e_wp_e_789      2018-06-12      -190.00 Immigration NZ 11       DEBIT
e_wp_e_818      2018-06-29      -153.43 Dept of Home Affa 26    DEBIT   AUD 141.37
e_wp_e_839      2018-07-18      -67.47  TT Services New Z 16    DEBIT
e_wp_e_2512     2021-07-27      -470.20 Dept Internal Aff 23    DEBIT
c_az_262        2021-10-01      -210.45 Mtm Accounting Ltd.     Mlm     Accounting
c_az_421        2021-07-19      -561.20 Paypal *Mtm             Paypal *Mtm
c_az_1082       2020-11-25      -52.20  Ministryofbusiness      285515Dpsa2A    Mbie
c_az_1110       2020-11-16      -2228.70        Mtm Accounting Ltd.     Mlm     Accounting

Criteria for Bank_charges e.g. would be:

split("^Cle|Forei|2 WBC|irnie W| \
wtown W|*M|^Repl|elex|^Unar|lert$,enance$|ebit|A/C|rice$", arr1, ",")
split("Variable Expenses,Fixed Expenses", arr2,",")
itm="Bank_charges

Some output from this criteria would result in:

c_az_1668   2020-03-06  -10.00  Visa Debit Card Fee 4825561******   4823    Bank_charges    Fixed Expenses
c_az_1687   2020-02-28  -8.50   Monthly A/C Fee         Bank_charges    Fixed Expenses
c_az_1688   2020-02-28  -2.50   Clearance Fee           Bank_charges    Variable Expenses
c_az_1785   2020-01-31  -8.50   Monthly A/C Fee         Bank_charges    Fixed Expenses

Another category that only has Variable expenses is:

split("^318|^74 |ASB|^City S|^Fix", arr1,",")
split("Variable Expenses", arr2,",")
itm="Bank_withdrawals"

This however does not work at all with the above action. All data from the input file is categorized Bank_withdrawals and classified as Variable expenses.

joanis
  • 10,635
  • 14
  • 30
  • 40
Christian Hick
  • 401
  • 3
  • 10
  • I assume you want to categorize the records based on the keywords in the 4th field. What is the dependency among the contents of `arr1`, `arr2` and `itm`? How will the `arr1` and `arr2` change if you introduce other categories such as `itm="Bank_charges"`? What is your expected output then? – tshiono Jan 24 '22 at 03:56
  • arr1 will be different for every category specified in itm. arr1 searches all those transactions that meet the specified search pattern. arr2 can only be either Fixed Expenses, Variable Expenses or both depending on the category. E.g. Utility expenses will probably all be Fixed expenses. The end goal is to have all transaction grouped into categories and classified as to whether they are Fixed or Variable Expenses. – Christian Hick Jan 24 '22 at 04:06
  • Thank you for the prompt feedback. Your answer is close to my expectation. Then would you please show an example of `arr1` which is related to `Bank_charges`? It does not have to be a full set. Just a few key words and the relevant input line(s) will be good enough. – tshiono Jan 24 '22 at 04:13
  • Thanks tshiono. I have edited my post to supply itm="Bank_charges" . – Christian Hick Jan 24 '22 at 04:31
  • I don't see the mapping between the sample input and the expected output. Where did `Visa Debit Card Fee 4825561******` or `Monthly A/C Fee` come from? Also your cod excerpts seem to be messed up with newlines and extraneous characters/strings throughout. You also posted some actual output, then the corresponding sample input but not the expected output for THAT input. Please [edit] your question to provide just 1 [mcve] with concise, testable sample input and expected output plus your existing code attempt to do that transformation. – Ed Morton Jan 24 '22 at 12:25

3 Answers3

1

Would you please try the following:

#!/usr/bin/awk -f

BEGIN {
    FS=OFS="\t";

    # enumerate the item, keyword and the statement in order
    itm = "Accounting & Legal"
    key[itm] = "Mtm|^Dept|^Im|^TT|ge[mn]$,^Mini"
    stmnt[itm] = "Variable Expenses,Fixed Expenses"

    itm = "Bank_charges"
    key[itm] = "^Cle|Forei|2 WBC|irnie W|wtown W|*M|^Repl|elex|^Unar|lert$,enance$|ebit|A/C|rice$"
    stmnt[itm] = "Variable Expenses,Fixed Expenses"

    itm="Bank_withdrawals"
    key[itm] = "^318|^74 |ASB|^City S|^Fix"
    stmnt[itm] = "Variable Expenses"

    # initialize key2 and stmnt2 using itm, key and stmnt defined above
    for (itm in key) {
        split(key[itm], a, ",")
        key2[itm,1] = a[1]
        key2[itm,2] = a[2]

        split(stmnt[itm], a, ",")
        stmnt2[itm,1] = a[1]
        stmnt2[itm,2] = a[2]
    }
}

# main loop with the input lines
{
    for (itm in key) {
        for (i = 1; i <= 2; i++) {
            if (key2[itm,i] != "") {
                if ($3 < 0 && $4 ~ key2[itm,i]) {
                    $7 = itm; $8 = stmnt2[itm,i]
                    print $0
                }
            }
        }
    }
}

If you have more items, append them with the keys and statements below the "Bank_charges" lines referring to the existing codes. If an item has either statement "Variable Expenses" or "Fixed Expenses", just put one element without a comma.

tshiono
  • 21,248
  • 2
  • 14
  • 22
  • Thanks tshiono. The script you have written works but when I append `itm="Bank_withdrawals" key[itm] = "^318|^74 |ASB|^City S|^Fix" stmnt[itm] = "Variable Expenses"` under Bank_charges without comma it does not work. – Christian Hick Jan 24 '22 at 09:06
  • Thank you for the feedback. By does not work, do you mean the same lines appear twice? If so, my consideration for the case w/o comma was not enough. Now it is fixed. Could you please test it? Sorry for the inconvenience. – tshiono Jan 24 '22 at 11:05
1

I think this is what you're trying to do but it's obviously untested since there's isn't a concrete sample of input and expected output in your question to test with:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    BEGIN{
        FS=OFS="\t"

        mkMap("Mtm|^Dept|^Im|^TT|ge[mn]$", "Variable Expenses", "Accounting & Legal")
        mkMap("^Mini", "Fixed Expenses", "Accounting & Legal")
        mkMap("^Cle|Forei|2 WBC|irnie W| wtown W|[*]M|^Repl|elex|^Unar|lert$", "Variable Expenses", "Bank_charges")
        mkMap("enance$|ebit|A/C|rice$", "Fixed Expenses", "Bank_charges")
        mkMap("^318|^74 |ASB|^City S|^Fix", "Variable Expenses", "Bank_withdrawals")
    }

    $3 < 0 {
        found = 0
        for ( re in re2type ) {
            if ( $4 ~ re ) {
                $7 = re2item[re]
                $8 = re2type[re]
                found = 1
            }
        }
        if ( found ) {
            print
        }
    }
    function mkMap(re,type,item) {
        re2type[re] = type
        re2item[re] = item
    }
' "${@:--}"
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks Ed, this is spot on. Does writing this as a bash script rather than an awk script (#! /usr/bin/awk -f) not slow possessing down? – Christian Hick Jan 24 '22 at 19:07
  • @ChristianHick you're welcome. Not in any meaningful/noticeable way and you should never use a shebang to call awk for the reasons discussed at https://stackoverflow.com/a/61002754/1745001. – Ed Morton Jan 24 '22 at 19:13
0

If you're new to awk, it might be best to look at writing this long hand:

#! /usr/bin/awk -f

BEGIN{ FS=OFS="\t"; }

$3<0 && $4~"Mtm|^Dept|^Im|^TT|ge[mn]$" {$7=;"Accounting & Legal"; $8="Variable Expenses"; print $0}
$3<0 && $4~"^Mini" {$7="Accounting & Legal"; $8="Fixed Expenses"; print $0}

(Please excuse typos, writing on my phone)

However I don't believe this is what you're actually asking for, so I'd suggest looping through the elements of the arrays that have been created in the begin like so:

#! /usr/bin/awk -f

BEGIN{
  FS=OFS="\t";

  els=split("Mtm|^Dept|^Im|^TT|ge[mn]$,^Mini", arr1, ",");
  split("Variable Expenses,Fixed Expenses", arr2, ",");
  split("Accounting & Legal","Accounting & Legal", arr3, ",");
}
for (i=1;i<=els;i++){
  $3<0 && $4~arr1[i] {$7=arr3[i]; $8=arr2[i]; print $0}
}
Jad
  • 1,257
  • 12
  • 19
  • Thanks Jad. I tried your suggestion but keep getting syntax errors: `awk: ./script.awk:30: for (i=1;i<=els;i++){ awk: ./script.awk:30: ^ syntax error awk: ./script.awk:30: for (i=1;i<=els;i++){ awk: ./script.awk:30: ^ syntax error awk: ./script.awk:30: for (i=1;i<=els;i++){ awk: ./script.awk:30: ^ syntax error awk: ./script.awk:30: for (i=1;i<=els;i++){ awk: ./script.awk:30: ^ syntax error awk: ./script.awk:31: $3<0 && $4~arr1[i] {$7=arr3[i]; $8=arr2[i]; print $0} ^ syntax error ` – Christian Hick Jan 24 '22 at 01:05
  • @ChristianHick those syntax errors are complaining about line 30. The script in the answer doesn't have 30 lines so whatever it is you're trying to execute it's not the script in the answer. – Ed Morton Jan 24 '22 at 12:19