Using awk in a wrong approach

Question

I am told I used awk in a wrong approach in the below code, but I am dumbfounded as to how to improve my code so that it is more simpler to read.

read -r bookName
read -r authorName

if grep -iqx "$bookName:$authorName" cutText.txt
then
    lineNum=`awk -v bookName="$bookName" -v authorName="$authorName" '$0 ~ bookName ":" authorName {print NR} BEGIN{IGNORECASE=1}' BookDB.txt`

    echo "Enter a new title"
    read -r newTitle

    awk -F":" -v bookName="$bookName" -v newTitle="$newTitle" -v lineNum="$lineNum" 'NR==lineNum{gsub(bookName, newTitle)}1' cutText.txt > temp2.txt
    mv -f temp2.txt cutText.txt
else
echo "Error"
fi

My cutText.txt contains content as shown below:

Hairy Potter:Rihanna
MARY IS A LITTLE LAMB:Kenny
Sing along:May

This program basically update a new title in cutText.txt. If a user wants to change MARY IS A LITTLE LAMB to Mary is not a lamb, he will enter the new title and cutText.txt will replace the original title with Mary is not a lamb.

A problem arises now that if a user enter "Mary is a little lamb" for $newTitle, this code of works just doesn't work, because it does take the case into account. It will only work is user types "MARY IS A LITTLE LAMB". I came to be aware that BEGIN{IGNORECASE=1} is gawk-sepcific, therefore it cannot be used in awk.

How can I script this better so I can ignore case in user input? Thank you!

You probably meant _A problem arises now that if a user enter "Mary is a little lamb" for $bookName_. Anyway, this is a `awk` only question. You should probably remove the `bash` and `shell` tags. And you should probably simplify your question as _How to tell a non-GNU awk to ignore case in patterns?_, with a small example of what behaviour you have and what behaviour you want. Indicating what version of `awk` you are using would be a plus. — Renaud Pacalet, Aug 13 '16 at 07:20
Let's get it working robustly first and then worry about "simpler to read" later ;-). Your current code will fail in various ways on partial matches, regexp metacharacters, escape characters, backreferences, colons, etc, in the book title or author name, and will erase your database if an error occurs in the awk script. — Ed Morton, Aug 13 '16 at 15:04

score 1 · Answer 1 · answered Aug 13 '16 at 09:58

To get you started. Create files

r.awk

function asplit(str, arr, sep,   temp, i, n) {  # make an assoc array from str
    n = split(str, temp, sep)
    for (i = 1; i <= n; i++)
        arr[temp[i]]++
    return n
}

function regexpify(s,   back, quote, rest, all, meta, n, c, u, l, ans) { 
    back = "\\"; quote = "\"";
    rest = "^$.[]|()*+?"
    all  = back quote rest
    asplit(all, meta, "")

    n = length(s)
    for (i=1; i<=n; i++) {
    c = substr(s, i, 1)
    if      (c in meta)
        ans = ans back c
    else if ((u = toupper(c)) != (l = tolower(c)))
        ans = ans "[" l u "]"
    else
        ans = ans c
    }

    return ans
}

BEGIN {
    old = regexpify(old)
    sep = ":"; m = length(sep)
}

NR == n {
    i = index($0, sep)
    fst = substr($0,   1, i-m)
    scn = substr($0, i+m     )

    gsub(old, new, fst)
    print fst sep scn

    next
}

{
    print
}

cutText.txt

Hairy Potter:Rihanna
MARY IS A LITTLE LAMB:Kenny
Sing along:May

Usage:

awk -v n=2 -v old="MArY iS A LIttLE lAmb" -v new="Mary is not a lamb" -f r.awk  cutText.txt

Expected output:

Hairy Potter:Rihanna
Mary is not a lamb:Kenny
Sing along:May

That is immensely over-complicated for the task at hand and it will fail when the old title contains `:` and when the new title contains "&" and in partial match situations when put in context in the OPs shell script, etc. Whenever you find yourself trying to escape all regexp metacharacters in a variable to make your code behave as if it were a string instead stop and think about that and then just use string operations instead of regexp operations to avoid all that complexity. — Ed Morton, Aug 13 '16 at 14:11

score 1 · Accepted Answer · edited May 23 '17 at 12:01

This uses exact string matching and so cannot fail on partial matches or if your old title contains : or regexp metacharacters or if the new title contains backreferences (e.g. &) or if a backslash (\) appears in any field or any of the other situations that your other scripts to date will fail on:

$ cat tst.sh
read -r oldTitle
read -r authorName

echo "Enter a new title"
read -r newTitle

awk '
BEGIN {
    ot=ARGV[1]; nt=ARGV[2]; an=ARGV[3]
    ARGV[1] = ARGV[2] = ARGV[3] = ""
}
tolower($0) == tolower(ot":"an) {
     $0 = nt":"an
     found = 1
}
{ print }
END {
    if ( !found ) {
        print "Error" | "cat>&2"
    }
}
' "$oldTitle" "$newTitle" "$authorName" cutText.txt > temp2.txt &&
mv -f temp2.txt cutText.txt

.

$ cat cutText.txt
Hairy Potter:Rihanna
MARY IS A LITTLE LAMB:Kenny
Sing along:May

$ ./tst.sh
mary is a little lamb
kenny
Enter a new title
Mary is not a lamb

$ cat cutText.txt
Hairy Potter:Rihanna
Mary is not a lamb:kenny
Sing along:May

I'm populating the awk variables from ARGV[] because if I populated them using -v var=val or var=val in the arg list then any backslashes would be interpreted and so \t, for example, would become a literal tab character. See the shell FAQ article I wrote about that a long time ago - http://cfajohnson.com/shell/cus-faq-2.html#Q24.

I changed bookName to oldTitle, btw just because that seems to make more sense in relation to newTitle. No functional difference.

When doing any text manipulation it's extremely important to understand the differences between strings and the various regexp flavors (BREs/EREs/PCREs) and between partial and full matches.

grep operates on BREs by default, on EREs given the -E arg, on PCREs given the -P arg, and on strings given the -F arg.
sed operates on BREs by default, on EREs given the -E arg. sed does not support PCREs. sed also cannot operate on strings and to make your regexps behave as if they were strings is painful, see is-it-possible-to-escape-regex-metacharacters-reliably-with-sed.
awk operates on both EREs and strings by default. You just use EREs with regexp operators and strings with string operators (see https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions).

So if, as in your case, you need all characters in your text treated literally then that is a string, not a regexp, so you should not be using sed on it, and if you want to quickly find a string in a file and are happy with a partial match, you should use grep, but if you want to do anything beyond that such as change a string in a file or do an exact match then you should use awk.

Wow thorough and easy to understand explanation! Appreciate it! Thanks for pointing out that my above code will also falsely match "Jaws:Henchley". I didn't even think of that. I learnt a lot from your comments and answer. I spent a good hour looking through your code and finally understood how you go about doing it. I can't thank you enough!! — JamesPoppycock, Aug 14 '16 at 08:43
You're welcome. You might want to consider using some character other than `:` for your title/author separator though since `:`s commonly appear in book titles. If I were you I'd use a tab character as the separator and convert all chains of white space in the title or authors name to a single blank char before inserting into the database, thereby ensuring that the only tab in each line is the separator. There's no reason I can think of that a tab character should appear in a book title or author name. That will make any further operations you want to do on the data much simpler. — Ed Morton, Aug 14 '16 at 15:42

score 0 · Answer 3 · answered Aug 13 '16 at 07:55

0

OK GUYS I JUST REALISED I AM DUMB AS ****

I was tearing my hair out for the whole day and all I had to do was to do this.

lineNum=`grep -in "$bookName:$authorName" BookDB.txt | cut -f1 -d":"`

sed -i "${lineNum}s/$bookName/$newTitle/I" BookDB.txt cutText.txt

Omg I feel like killing myself.

answered Aug 13 '16 at 07:55

JamesPoppycock

37
6

No, that's the wrong approach and will fail with false matches (look up "Jaws:Henchley" when you have "The Dentists Guide To Jaws:Henchley McBoring" in your catalog) and when the new title contains backreferences (try to replace any title with "War & Peace") or any BRE metacharacters appear in bookName or authorNament and in other situations. The UNIX tool for manipulating text is awk. When you find yourself reaching for shell+grep+sed combinations stop and pick up the awk book (Effective Awk Programming, 4th Edition, by Arnold Robbins) instead to figure out the right way to do it. – Ed Morton Aug 13 '16 at 14:52

Using awk in a wrong approach

3 Answers3