Bash script group and count by a specific field

Question

sorry if I open a new question, but it's not related to a previous one since now I need a bash command to analyze the output.

I have an output from query stored in a file like this:

3277654321    333011123456789
3277654321    333015123456789
3277654321    333103123456789
3277654321    333201123456789
3291234567    333991123456789
3291234567    333991123456789
3291234567    333011123456789

What I need is a bash command to count the field1 and field2 having the same first 5 digits and report an output like this:

3277654321=4;33301=2;33310=1;33320=1    
3291234567=3;33399=2;33301=1

Thanks Lucas.

This would not be too challenging to do with `awk`, and undoubtedly someone will provide the code-writing service eventually, even though SO is not intended to be a "please write my code for me" service. But it seems to me like it would be better to generate the report you want directly from your database, using a query which actually generates the final result rather than an intermediate list. — rici, Jan 17 '15 at 17:36
If you are querying from a database, it might be easier to do it directly. — Amit, Jan 17 '15 at 17:39
Generating a report like that direct from the database is all kinds of messy. As long as the data is sorted (which the database can do), then post-processing is not unreasonable. And the database can and should do a bit more work for you: it could reasonably generate the first field, the first 5 characters of the second field, and a count of the number of entries: `SELECT field1, SUBSTR(field2, 1, 5) AS field2, COUNT(*) AS number FROM TheTable GROUP BY field1, field2 ORDER BY field1, field2`. You then have less data being transferred over the wire, which helps a lot if the database is remote. — Jonathan Leffler, Jan 17 '15 at 18:29
@JonathanLeffler: The following seems to work fine in sqlite3; for mysql, you'd need to change the comma to the word SEPARATOR in group_concat: `select field1||"="||SUM(count2)||";"||group_concat(field2||"="||count2,";") as fields FROM (select field1, SUBSTR(field2,1,5) AS field2, COUNT(*) as count2 from tmp GROUP BY field1, field2 ORDER BY field1, field2) GROUP BY field1 ORDER BY field1;`. It's not *that* messy, imho. — rici, Jan 17 '15 at 20:32
@rici: Yup, as you show, with the non-standard `GROUP_CONCAT` aggregate (and with ORDER BY available in sub-queries, also non-standard), it is not too bad. (Non-standard, as in 'not part of ISO standard SQL', AFAIK.) — Jonathan Leffler, Jan 17 '15 at 20:40
@JonathanLeffler: Sure, it's not standard, but it's widely available in some form or another. (http://www.vertabelo.com/blog/technical-articles/group-concat lists several, and Google found a number of solutions for SQL Server.) Anyway, almost no significant database application is restricted to ISO standard SQL, from what I've seen; it's very common to use the features available in your DB server, and why not? — rici, Jan 17 '15 at 21:31
@rici: If your DBMS supports it, or can be made to support it, there's no major reason not to use it. If your DBMS doesn't support it, there's good reason not to use it. If you note, I've included your material in my answer (with what I believe is due credit given). Part of my reason for cautiousness is that the DBMS I worked with mainly don't support GROUP_CONCAT as standard; one of them can be augmented with a [user-defined aggregate](http://stackoverflow.com/questions/715350), and maybe the other can too, but that may be more unacceptable to some customers in some circumstances. YMMV. — Jonathan Leffler, Jan 17 '15 at 21:35
rici, this is a very interesting approach. Anyway, I have a very ammount of data in DB, that's why I decide to keep the query as simple as possible and then analyze the output using bash. However I got the following error when I try to execute your query. ERROR 1248 (42000): Every derived table must have its own alias — Lucas Rey, Jan 18 '15 at 07:21

score 1 · Answer 1 · edited May 23 '17 at 12:05

Using `awk` on the original data

What you're seeking is a control-break report. For once, the Wikipedia entry isn't much help on the subject. The sample data is shown sorted; this solution assumes that the data is sorted, therefore (but it is trivial to add a sort operation before the awk script if it isn't sorted; OTOH, since the data comes from a database, the DBMS could perfectly well sort the data).

For testing purposes, I created a file awk.script containing:

{   f1 = $1
    f2 = substr($2, 1, 5)
    if (oldf1 != f1)
    {
        if (oldf1 != 0)
        {
            summary = summary ";" oldf2 "=" f2_count
            printf("%s=%d%s\n", oldf1, f1_count, summary)
        }
        oldf1 = f1
        f1_count = 0
        oldf2 = f2
        f2_count = 0
        summary = ""
    }
    else if (oldf2 != f2)
    {
        summary = summary ";" oldf2 "=" f2_count
        oldf2 = f2
        f2_count = 0
    }
    f1_count++
    f2_count++
}
END {
    if (oldf1 != 0)
    {
        summary = summary ";" oldf2 "=" f2_count
        printf("%s=%d%s\n", oldf1, f1_count, summary)
    }
}

And put the seven lines of sample data into a file called data, and then ran:

$ awk -f awk.script data
3277654321=4;33301=2;33310=1;33320=1
3291234567=3;33399=2;33301=1
$

Make the DBMS do more work

At the moment, the data is similar to the output from a query such as:

SELECT Field1, Field2
  FROM SomeTable
 ORDER BY Field1, Field2

The output could be made better for your report by having the DBMS generate the first field, the first 5 characters of the second field, and a count of the number of entries:

SELECT field1, SUBSTR(field2, 1, 5) AS field2, COUNT(*) AS number
  FROM SomeTable
 GROUP BY field1, field2
 ORDER BY field1, field2

You then have less data being transferred over the wire, which helps a lot if the database is remote. You also have a simpler report. The data file becomes (data2):

3277654321 33301 2
3277654321 33310 1
3277654321 33320 1
3291234567 33399 2
3291234567 33301 1

The awk script becomes (awk.script2):

{   
    if (oldf1 != $1)
    {
        if (oldf1 != 0)
            printf("%s=%d%s\n", oldf1, f1_count, summary)
        oldf1 = $1
        f1_count = 0
        summary = ""
    }
    summary = summary ";" $2 "=" $3
    f1_count += $3
}
END {
    if (oldf1 != 0)
        printf("%s=%d%s\n", oldf1, f1_count, summary)
}

Sample run:

$ awk -f awk.script2 data2
3277654321=4;33301=2;33310=1;33320=1
3291234567=3;33399=2;33301=1
$

Make the DBMS do even more work

Depending on your DBMS and whether it supports GROUP_CONCAT and ORDER BY clauses in sub-queries, you can note that rici suggested "It's not that messy, IMHO".

The following seems to work fine in SQLite3; for MySQL, you'd need to change the comma to the word SEPARATOR in GROUP_CONCAT:

SELECT field1 || "=" || SUM(count2) || ";" ||
           group_concat(field2 || "=" || count2, ";") AS fields
  FROM (SELECT field1, SUBSTR(field2, 1, 5) AS field2, COUNT(*) AS count2
          FROM tmp
         GROUP BY field1, field2
         ORDER BY field1, field2
       )
 GROUP BY field1
 ORDER BY field1

Note that both GROUP_CONCAT and ORDER BY clauses in sub-queries are not defined by ISO standard SQL, as far as I know, so not all DBMS will support the features. (The ORDER BY feature was omitted for reason, but the reasoning didn't include considerations of 'orthogonality'.)

If the DBMS produces the data in the format you need, there's no need for an awk script to post-process it. What's best will ultimately depend on what else you're doing. Generally, use the DBMS to do calculations where it makes sense. IMO, don't use the DBMS for all formatting — I expect report generation with pagination etc to be done outside the DBMS proper — but if it can be persuaded to generate the data you need, by all means make it do the work.

Awesome Jonathan, the problem for the query is that I need complete field2 to report back to other log file. However, the approch is cool, but how can I implement that in a code without using external script file? — Lucas Rey, Jan 17 '15 at 19:45
If you need the complete field2 somewhere else, then you'll need to use the first option. You can avoid the separate script file by enclosing the script in single quotes in place of '`-f awk.script`' on the command line. If you like (I don't, but you may have different views on the subject), you can flatten the whole program into a single line as long as you add a suitable sprinkling of semicolons. I _emphatically_ don't recommend that, though. You need the code to be readable. — Jonathan Leffler, Jan 17 '15 at 19:51
you can merge your two script into one. For example, your first script could be `gawk '{arr[$1][substr($2,0,5)]++}END{for (i in arr) for(j in arr[i]) print i, j, arr[i][j]}' groupTest` . you can modify on top of that to get one single script. — qqibrow, Jan 17 '15 at 21:01

score 0 · Answer 2 · answered Jan 18 '15 at 07:34

People, I would like to share an "elegant" solution. Thanks are flying to other community user who drive me suggesting some steps.

awk     'NR>0   {C1[$1]++
                 C2[$1,substr($2,1,5)]++
                }
         END {for (c2 in C2) {split (c2, cx, SUBSEP); print cx[1] "=" C1[cx[1]] ";" cx[2] "=" C2[c2]}}
        ' SUBSEP=";" out.txt | sort | awk     '$1 != L        {printf "%s%s", LT, $1; L=$1; LT="\n"}
                        {printf ";%s", $2}
         END {printf "\n"}
        ' FS=";"

3277654321=4;33301=2;33310=1;33320=1
3291234567=3;33399=2;33301=1

And rici, this is not the case I ask someone to write code for me. This is a very little part of a big script, so I just ask help on how to do a little thing. I'm interesting in different approach, that's why I ask without providing any code examples. Thanks to all SO users who partecipating to this question, I'm still open to try different approach.

Bash script group and count by a specific field

2 Answers2

Using awk on the original data

Make the DBMS do more work

Make the DBMS do even more work

Using `awk` on the original data