-1

I have a script that will iterate over a file containing domains (google.com, youtube.com, etc). The purpose of the script is to check how many times each domain is included in the 12th column of a tab seperated value file.

while read domain; do
    awk -F '\t' '$12 == '$domain'' data.txt | wc -l
done < domains.txt

However awk seems to be interpretating the dots in the domains as a special character. The following error message is shown:

awk: syntax error at source line 1
context is
        $12 ~ >>>  google. <<< com
awk: bailing out at source line 1

I am a beginner in bash so any help would be greatly appreciated!

adahy
  • 25
  • 4
  • Please, post some sample data with the related expected output. Don't post them as comments, images, tables or links to off-site services but use text and include them to your original question. Thanks. – James Brown Jun 28 '22 at 12:09
  • @JamesBrown added wc -l. The tsv file has a lot of different outputs. But to clarify the 12th column contains a bunch of domains seperated by commas. An example of values in the 12th column is autodiscover.intelart.com.br, cpanel.intelart.com.br, equipegoogle.com.br, intelart.com.br, mail.equipegoogle.com.br, mail.intelart.com.br, webdisk.intelart.com.br, webmail.intelart.com.br, www.equipegoogle.com.br, www.intelart.com.br domains.google.com – adahy Jun 28 '22 at 12:13
  • you can't nest single-quotes inside single-quotes. awk is seeing bare expansion of `$domain` not a string or regex – jhnc Jun 28 '22 at 12:14
  • your code doesn't consider substring matches: if domain is `ab.co` it will match `xab.co`, `ab.com`, etc. If each column 12 is a single domain, just compare with `==` – jhnc Jun 28 '22 at 12:14
  • @jhnc I have got that covered. The list also contains subdomains (that I want to be counted) so thats why I am not using ==. The focus here is on the dot in variable name. Are you saying I can't use any variables in awk matching? – adahy Jun 28 '22 at 12:16
  • If you have fixed the single-quoting mistake in your code, show the new code that demonstrates the problem, not this broken version. – jhnc Jun 28 '22 at 12:20
  • 1
    quoting matters: `awk 'BEGIN{ a=3; print 'a'; print "a"}'` - the first `'a'` is probably not doing what you think it is. consider also: `awk 'BEGIN{ a=3; print '"a"'; print "'a'"}'` – jhnc Jun 28 '22 at 12:23
  • @jhnc The == or ~ issue is already accounted for. The single-quoting mistake is not. Do you know how to fix it? – adahy Jun 28 '22 at 12:25

4 Answers4

3

When you write:

domain='google.com'
awk -F '\t' '$12 == '$domain'' data.txt

the $domain is outside of any quotes:

awk -F '\t' '$12 == '$domain'      ' data.txt
            <       >       <      >
            start   end     start  end

and so exposed to the shell for interpretation first and THEN it becomes part of the body of the awk script before awk sees it. So what awk sees is:

awk -F '\t' '$12 == google.com' data.txt

and google.com is not a valid symbol (e.g. variable or function) name nor string nor number. What you MEANT to do was:

awk -F '\t' '$12 == "'"$domain"'"' data.txt

so the shell would see "$domain" instead of just $domain (see https://mywiki.wooledge.org/Quotes for why that's important) and awk would finally see:

awk -F '\t' '$12 == "google.com"' data.txt

which is fine as now "google.com" is a string, not a symbol BUT you should never allow shell variables to expand to become part of an awk script as there are other caveats so what you should really have done is:

awk -F '\t' -v dom="$domain" '$12 == dom' data.txt

See How do I use shell variables in an awk script? for more information.

By the way, even after fixing the above problem do not do this:

while read domain; do
    awk -F '\t' -v dom="$domain" '$12 == dom' data.txt | wc -l
done < domains.txt

as it'll be immensely slow and contains insidious bugs (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice). Do something like this instead (untested):

awk -F'\t' '
    NR==FNR {
        cnt[$1] = 0
        next
    }
    $12 in cnt {
        cnt[$12]++
    }
    END {
        for ( dom in cnt ) {
            print dom, cnt[dom]
        }
    }
' domains.txt data.txt

That will be far more efficient, robust, and portable than calling awk inside a shell read loop.

See What are NR and FNR and what does "NR==FNR" imply? for how that awk script works. Get the book Effective AWK Programming, 5th Edition, by Arnold Robbins to learn awk.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • when adding the quotation marks I got awk: non-terminated string google.com... at source line 1 context is <<< $12 ~ >>> "google.com – adahy Jun 28 '22 at 12:52
  • Then you copy/pasted incorrectly. Show us the full script that is outputting that error message so we can help you debug it. – Ed Morton Jun 28 '22 at 12:53
  • I am also having difficulties understanding your alternative solution. Could you please elaborate on that approach? Again, I am a beginner in bash so I would really appreciate your help – adahy Jun 28 '22 at 12:54
  • No point discussing that til we debug your first problem from your comment above but I added a reference that explains it (it's an **extremely** common awk idiom). – Ed Morton Jun 28 '22 at 12:56
  • If you'd like help debugging [your current quotation marks problem](https://stackoverflow.com/questions/72786078/awk-match-by-variable-with-dot-in-it#comment128563793_72786501) then show us the script now before we move on to help someone else. – Ed Morton Jun 28 '22 at 13:06
  • The script is currently: while read domain; do awk -F '\t' '$12 ~ "'"$domain"'"' data.txt | wc -l done < domains.txt – adahy Jun 28 '22 at 13:12
  • That script will only produce the error message you say it does if `$domain` contains unexpected characters such as `"`s or maybe a backslash at the end. That's one of the reasons not to let a shell variable expand to become part of an awk script - cryptic errors due to input data. As mentioned in my answer - don't do that. – Ed Morton Jun 28 '22 at 13:16
  • Why are you using `$12 ~` (a partial regexp match) in the script in your comment when you're using `$12 ==` (a full string match) in your question? It's very unlikely that a partial regexp match would be the best solution for whatever you're trying to do but the fact you think you need it means you may need a partial string instead of full string match which would require a different solution that the one in my answer. – Ed Morton Jun 28 '22 at 13:20
  • meant to write ==, my bad. – adahy Jun 28 '22 at 13:21
0
awk -F '\t' '$12 == '$domain'' data.txt | wc -l

The single quotes are building an awk program. They are not something visible to awk. So awk sees this:

$12 == google.com

Since there aren't any quotes around google.com, that is a syntax error. You just need to add quotation marks.

awk -F '\t' '$12 == "'"$domain"'"'  data.txt

The quotes jammed together like that are a little confusing, but it's just this:

 '....'    stuff to send to awk. Single quotes are for the shell.
 '..."...' a double quote inside the awk program for awk to see
 '...'"..."  stuff in double quotes _outside_ the awk program for the shell

We can combine those like this:

 '..."'"$var"'"...'  

That's a bunch of literal awk code ending in a double-quote, followed by the expansion of the shell parameter var, which is double-quoted as usual in the shell for safety, followed by more literal awk code starting with a double quotes. So the end result is a string passed to awk that includes the value of the var inside double quotes.

But you don't have to be so fancy or confusing since awk provides the -v option to set variables from the shell:

awk -v domain="$domain" '$12 == domain' data.txt

Since the domain is not quoted inside the awk code, it is interpreted as the name of a variable. (Periods are not legal in variable names, which is why you got a syntax error with your domains; if you hadn't, though, awk would have treated them as empty and been looking for lines whose twelfth field was likewise blank.)

Mark Reed
  • 91,912
  • 16
  • 138
  • 175
  • when adding the quotation marks I got awk: non-terminated string google.com... at source line 1 context is <<< $12 ~ >>> "google.com – adahy Jun 28 '22 at 12:43
  • First, you should use the `-v` version instead, because it's easier. Second, you must have missed a double-quote. There are four: the open double quote inside the single-quoted awk code, the open outside the single-quotes for the shell, the close still outside single quotes, and the close back inside single quotes for awk. – Mark Reed Jun 28 '22 at 15:12
0

Use a combination of cut to print the 12th column of the TAB-delimited file, sort and uniq to count the items:

cut -f12 data.txt | sort | uniq -c
Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
0

This should give the count of how many lines of the input has "google.com" in $12

{m,g}awk -v __="${domain}" '
 BEGIN { _*=\
      (  _ ="\t[^\t]*")*gsub(".",(_)_,_)*sub(".","",_)*\
  gsub("[.:&=/-]","[&]",__)*sub("[[][^[]+$",__"\t?",_)*(\
  
 FS=_ } { _+=NF } END { print _-NR }'
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11