Using awk
on the original data
What you're seeking is a control-break report. For once, the Wikipedia entry isn't much help on the subject. The sample data is shown sorted; this solution assumes that the data is sorted, therefore (but it is trivial to add a sort operation before the awk
script if it isn't sorted; OTOH, since the data comes from a database, the DBMS could perfectly well sort the data).
For testing purposes, I created a file awk.script
containing:
{ f1 = $1
f2 = substr($2, 1, 5)
if (oldf1 != f1)
{
if (oldf1 != 0)
{
summary = summary ";" oldf2 "=" f2_count
printf("%s=%d%s\n", oldf1, f1_count, summary)
}
oldf1 = f1
f1_count = 0
oldf2 = f2
f2_count = 0
summary = ""
}
else if (oldf2 != f2)
{
summary = summary ";" oldf2 "=" f2_count
oldf2 = f2
f2_count = 0
}
f1_count++
f2_count++
}
END {
if (oldf1 != 0)
{
summary = summary ";" oldf2 "=" f2_count
printf("%s=%d%s\n", oldf1, f1_count, summary)
}
}
And put the seven lines of sample data into a file called data
, and then ran:
$ awk -f awk.script data
3277654321=4;33301=2;33310=1;33320=1
3291234567=3;33399=2;33301=1
$
Make the DBMS do more work
At the moment, the data is similar to the output from a query such as:
SELECT Field1, Field2
FROM SomeTable
ORDER BY Field1, Field2
The output could be made better for your report by having the DBMS generate the first field, the first 5 characters of the second field, and a count of the number of entries:
SELECT field1, SUBSTR(field2, 1, 5) AS field2, COUNT(*) AS number
FROM SomeTable
GROUP BY field1, field2
ORDER BY field1, field2
You then have less data being transferred over the wire, which helps a lot if the database is remote. You also have a simpler report. The data file becomes (data2
):
3277654321 33301 2
3277654321 33310 1
3277654321 33320 1
3291234567 33399 2
3291234567 33301 1
The awk
script becomes (awk.script2
):
{
if (oldf1 != $1)
{
if (oldf1 != 0)
printf("%s=%d%s\n", oldf1, f1_count, summary)
oldf1 = $1
f1_count = 0
summary = ""
}
summary = summary ";" $2 "=" $3
f1_count += $3
}
END {
if (oldf1 != 0)
printf("%s=%d%s\n", oldf1, f1_count, summary)
}
Sample run:
$ awk -f awk.script2 data2
3277654321=4;33301=2;33310=1;33320=1
3291234567=3;33399=2;33301=1
$
Make the DBMS do even more work
Depending on your DBMS and whether it supports GROUP_CONCAT
and ORDER BY
clauses in sub-queries, you can note that rici suggested "It's not that messy, IMHO".
The following seems to work fine in SQLite3; for MySQL, you'd need to change the comma to the word SEPARATOR in GROUP_CONCAT:
SELECT field1 || "=" || SUM(count2) || ";" ||
group_concat(field2 || "=" || count2, ";") AS fields
FROM (SELECT field1, SUBSTR(field2, 1, 5) AS field2, COUNT(*) AS count2
FROM tmp
GROUP BY field1, field2
ORDER BY field1, field2
)
GROUP BY field1
ORDER BY field1
Note that both GROUP_CONCAT
and ORDER BY
clauses in sub-queries are not defined by ISO standard SQL, as far as I know, so not all DBMS will support the features. (The ORDER BY feature was omitted for reason, but the reasoning didn't include considerations of 'orthogonality'.)
If the DBMS produces the data in the format you need, there's no need for an awk
script to post-process it. What's best will ultimately depend on what else you're doing. Generally, use the DBMS to do calculations where it makes sense. IMO, don't use the DBMS for all formatting — I expect report generation with pagination etc to be done outside the DBMS proper — but if it can be persuaded to generate the data you need, by all means make it do the work.