1

I am using in my shell script TR command in awk to mask the data. Below example file affects only first line of the my file when i used tr command in awk. when i use the same in while loop and called the awk command inside of it then its working fine but it taking very long time to get completed. Now my requirement i want to mask many columns[example :$1, $5, $9] in the same file(file.txt) and this should affect the whole file not first line and i want to achieve this as much as faster to mask the data. Please advise

cat file.txt
========
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
abcbchs,degehek
abcbchs,degehek,lskjsjshsh

OUTPUT

awk -F"," -v OFS=","  '{ "echo \""$1"\" | tr \"a-c\" \"e-f\" | tr \"0-5\" \"6-9\"" | getline $1 }7' file.txt

effffhs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
abcbchs,degehek
abcbchs,degehek,lskjsjshsh

Expected output

effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek,lskjsjshsh
effffhs,degehek
effffhs,degehek,lskjsjshsh

user2449709
  • 21
  • 1
  • 4
  • 2
    You are trying to run bash code in awk, but awk is a completely separate language from bash. If you want to run bash code on each line, use a [while read loop](http://mywiki.wooledge.org/BashFAQ/001) instead. – that other guy Mar 24 '15 at 19:00
  • 1
    It looks vaguely like you copy/pasted code from [here](http://stackoverflow.com/questions/21766541/how-to-translate-a-column-value-in-the-file-using-awk-with-tr-command-in-unix) but that's really not an idiomatic or common way to do it. – tripleee Mar 24 '15 at 19:03
  • @thatotherguy .. **while** loop reads the whole line of the file, i tried its working fine and expected result also i got but the time factor is the issue .. i used to mask the data in different kind of files with different delimiters, columns also vary from the files, i hope using awk is the best way to do .. if we do in awk command line to get the expected the result. then i will call in my ksh shell script.. – user2449709 Mar 24 '15 at 19:19

2 Answers2

4

The code you found runs an external shell command pipeline on each input line. Like you discovered, that's an awfully inefficient way to do what you are asking. Awk isn't really an ideal choice for this task at all. Maybe try Perl.

perl -F, -lane '$F[$_] =~ tr/a-c/e-f/ =~ tr/0-5/6-9/ for (0, 4, 8); print join(",", @F)' file

The -F, option is like with Awk, but Perl doesn't automatically split the input line. With -a it does, splitting into an array named @F, and with -n it loops over all input lines. The -l is a convenience to remove newlines from each input line and adding one back when you print.

Notice how the columns are numbered from zero, not one, like in Awk; so the indices in the for loop access the first, fifth, and ninth elements of @F.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Based on the tangential comment from the OP, it attempts to modify multiple columns. As a side effect, it adds new empty columns if there are fewer than nine. Try with input data with more columns, or change the indices to only manipulate columns you actually have. – tripleee Mar 24 '15 at 21:02
  • @user2449709 Although you commented on Ed Morton's answer, let's continue the discussion here. If you cannot get this script to work, I need more information in order to diagnose what's wrong. It works for me. Feel free to play around with the demo at http://ideone.com/IvGUJW – tripleee Mar 25 '15 at 18:11
3

You forgot to close() the command after every invocation. Here's the correct way to write it:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    cmd="echo '" $1 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
    $1 = ( (cmd | getline line) > 0 ? line : $1 )
    close(cmd)
    print
}

$ awk -f tst.awk file
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek,lskjsjshsh
effffhs,degehek
effffhs,degehek,lskjsjshsh

You also didn't protect yourself from getline failures, hence the extra complexity around the getline call, see http://awk.info/?tip/getline.

Given your comments, this shows how to modify multiple fields (1, 3, and 5 in this case) simultaneously:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    cmd = "echo '" $0 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
    new = ( (cmd | getline line) > 0 ? line : $1 )
    close(cmd)
    split(new,tmp)
    for (i in tmp) {
        if (i ~ /^(1|3|5)$/) {
            $i = tmp[i]
        }
    }
    print
}

$ cat file
abc,abc,abc,abc,abc
abc,abc,abc,abc,abc,abc,abc
abc,abc,abc,abc,abc,abc
abc,abc,abc,abc

$ awk -f tst.awk file
eff,abc,eff,abc,eff
eff,abc,eff,abc,eff,abc,abc
eff,abc,eff,abc,eff,abc
eff,abc,eff,abc

To handle quotes in the input data:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    gsub(/'/,SUBSEP)
    cmd = "echo '" $0 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
    new = ( (cmd | getline line) > 0 ? line : $1 )
    close(cmd)
    split(new,tmp)
    for (i in tmp) {
        if (i ~ /^(1|3|5)$/) {
            $i = tmp[i]
        }
    }
    gsub(SUBSEP,"'")
    print
}

$ cat file
a'c,abc,a"c,abc,abc
abc,a'c,abc,a"c,abc,abc,abc
abc,abc,abc,abc,abc,abc
abc,abc,abc,abc

$ awk -f tst.awk file
e'f,abc,e"f,abc,eff
eff,a'c,eff,a"c,eff,abc,abc
eff,abc,eff,abc,eff,abc
eff,abc,eff,abc

If you don't have any particular control char that's guaranteed not to appear in your input, you can create a non-existent string to use instead of SUBSEP above by using the technique described at the end of https://stackoverflow.com/a/29237745/1745001

Community
  • 1
  • 1
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Although good advice, this does not appear to address the efficiency concerns raised by the OP. – tripleee Mar 24 '15 at 21:16
  • The OP didn't mention any efficiency concerns in his question other than saying calling the awk command in a shell loop took a long time and I strongly suspect that is because he is doing something goofy in his shell loop like calling awk on his whole file once for each line of his file but he didn't show us that so idk. He can try this and if it's too slow we can take it from there... – Ed Morton Mar 24 '15 at 21:20
  • @Ed Morton.. i tried your script and it affects the whole file of the 1st column.. expected output is received but still the timings is the factor .. i tried to manipulate 240 MB file with 1 column to be affected it takes one hour, Even in my script takes same duration and i verified, another problem for each column taking 1 hours then for 8 columns it takes 8 hours. i hope there is way to manipulate mutiple columns to be masked at a time and i want to reduce the processing time. – user2449709 Mar 25 '15 at 17:39
  • @Ed Morton. My code ======= for COL in `echo "5,6,8" | sed 's/,/ /g'` do while read lne; do echo "$lne" | awk -F"$DEL" -v OFS="$DEL" '{ "echo \""$'$COL'"\" | tr \"a-c\" \"e-f\" | tr \"0-5\" \"6-9\"" | getline $'COL' }7' >> file_bak ; done < file mv file_bak file done – user2449709 Mar 25 '15 at 17:41
  • @user2449709 you would not call the awk script multiple times to manipulate multiple columns, instead you would just loop through the target columns inside the awk script. If you posted some sample input that actually HAS multiple columns to change and the associated output then we could help you more. – Ed Morton Mar 25 '15 at 18:23
  • @user2449709 I updated my script and added my own sample input/output to show how to modify multiple columns simultaneously. – Ed Morton Mar 25 '15 at 18:37
  • @EdMorton : Really this is working but i am getting error when the record contains in the file having special character single quote ' . I modified this line echo \"'" $0 "'\" | tr \"a-c\" \"e-f\" | tr \"0-5\" "\6-9\" then working fine but after changing this when file contains backquote or acute then another error here your previous solution is working fine .. Can you please tell me any generic way to handle all these special characters when reading the whole line of the file .. apart from this i will let you know the processing time using your code. i believe this will works – user2449709 Mar 26 '15 at 10:17
  • I don't believe there is a 100% robust general way to handle all possible "special characters" because their meaning is context sensitive but to just handle single quotes - leave the script as I wrote wrt quoting but surround the main part of it with code to replace and later restore each `'` with a control char, e.g. `gsub(/'/,SUBSEP) ... gsub(SUBSEP,"'")`. That should work as long as the awk SUBSEP char doesn't appear in your input. I updated my answer to show that at the end. – Ed Morton Mar 26 '15 at 15:53
  • 1
    @Ed Morton ..Thanks for you help and ideas, it reduced the time when files contained mutiple columns to be changed. – user2449709 Mar 30 '15 at 14:42
  • You're welcome, remember to click on the check mark next to whichever answer you end up accepting, if any. – Ed Morton Mar 30 '15 at 15:17