There is no need for multiple processes and pipes. awk
alone is more than capable of handling the entire job (and will be orders of magnitude faster on large files). With awk
simply append each of the fields 2-NF
as a string and use that as an index to sum the numbers in field 1 in an array. Then in the END
section, simply output the contents of the array, e.g. presuming your data is stored in file
, you could do:
awk '{
for (i=2; i<=NF; i++)
str = str " " $i
a[str] += $1
str=""
}
END {
for (i in a) print a[i], i
}' file
Above, the first for
loop simply appends all fields from 2-NF
in str
, a[str] += $1
sums the values in field 1 into array a
using str
as an index. That ensures the values for similar lines are summed. In the END
section, you simply loop over each element of the array outputting the element value (the sum) and then the index (original str
for fields 2-NF
).
Example Use/Output
Just take what is above, select it, and then middle-mouse paste it into a command line in the directory where your file
is located (change the name of file
to your data file name)
$ awk '{
> for (i=2; i<=NF; i++)
> str = str " " $i
> a[str] += $1
> str=""
> }
> END {
> for (i in a) print a[i], i
> }' file
30 take a test
37 cup of coffee
75 sign on the dotted
If you want the lines sorted in a different order, just add | sort [options]
after the filename to pipe the output to sort
. For example for output in the order you show, you would use | sort -k 2
and the output would be:
37 cup of coffee
75 sign on the dotted
30 take a test
Preserving Original Order Of Strings
Pursuant to your comment regarding how to preserve the original order of the lines of text seen in your input file, you can keep a second array where the strings are stored in the order they are seen using a sequential index to keep them in order. For example the o
array (order array) is used below to store the unique string (fields 2-NF
) and the variable n
is used as a counter. A loop over the array is used to check whether the string is already contained, and if so, next
is used to avoid storing the string and jump to the next record of input. In END
the loop then uses a for (i = 0; i < n; i++)
form to output the information from both arrays in the order the string were seen in the original file, e.g.
awk -v n=0 '{
for (i=2; i<=NF; i++)
str = str " " $i
a[str] += $1
for (i = 0; i < n; i++)
if (o[i] == str) {
str=""
next;
}
o[n++] = str;
str=""
}
END {
for (i = 0; i < n; i++) print a[o[i]], o[i]
}' file
Output
37 cup of coffee
75 sign on the dotted
30 take a test