2

I have a file with thousands of lines, each containing a number followed by a line of text. I'd like to add up the numbers for the lines whose text is similar. I'd like unique lines to be output as well.

For example:

25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee

The output would be:

37 cup of coffee
75 sign on the dotted
30 take a test

Any suggestions how this could be achieved in unix shell?

I looked at Shell command to sum integers, one per line? but this is about summing up a column of numbers across all lines in a file, not across similar text lines only.

Tony
  • 25
  • 3

6 Answers6

4

There is no need for multiple processes and pipes. awk alone is more than capable of handling the entire job (and will be orders of magnitude faster on large files). With awk simply append each of the fields 2-NF as a string and use that as an index to sum the numbers in field 1 in an array. Then in the END section, simply output the contents of the array, e.g. presuming your data is stored in file, you could do:

awk '{
    for (i=2; i<=NF; i++)
        str = str " " $i
    a[str] += $1
    str=""
}
END {
    for (i in a) print a[i], i
}' file

Above, the first for loop simply appends all fields from 2-NF in str, a[str] += $1 sums the values in field 1 into array a using str as an index. That ensures the values for similar lines are summed. In the END section, you simply loop over each element of the array outputting the element value (the sum) and then the index (original str for fields 2-NF).

Example Use/Output

Just take what is above, select it, and then middle-mouse paste it into a command line in the directory where your file is located (change the name of file to your data file name)

$ awk '{
>     for (i=2; i<=NF; i++)
>         str = str " " $i
>     a[str] += $1
>     str=""
> }
> END {
>     for (i in a) print a[i], i
> }' file
30  take a test
37  cup of coffee
75  sign on the dotted

If you want the lines sorted in a different order, just add | sort [options] after the filename to pipe the output to sort. For example for output in the order you show, you would use | sort -k 2 and the output would be:

37  cup of coffee
75  sign on the dotted
30  take a test

Preserving Original Order Of Strings

Pursuant to your comment regarding how to preserve the original order of the lines of text seen in your input file, you can keep a second array where the strings are stored in the order they are seen using a sequential index to keep them in order. For example the o array (order array) is used below to store the unique string (fields 2-NF) and the variable n is used as a counter. A loop over the array is used to check whether the string is already contained, and if so, next is used to avoid storing the string and jump to the next record of input. In END the loop then uses a for (i = 0; i < n; i++) form to output the information from both arrays in the order the string were seen in the original file, e.g.

awk -v n=0 '{
    for (i=2; i<=NF; i++)
        str = str " " $i
    a[str] += $1
    for (i = 0; i < n; i++)
        if (o[i] == str) {
            str=""
            next;
        }
    o[n++] = str;
    str=""
}
END {
    for (i = 0; i < n; i++) print a[o[i]], o[i]
}' file

Output

37  cup of coffee
75  sign on the dotted
30  take a test
David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • How to maintain the order of appearance of text (key), while populating / printing array. – user0 Jul 04 '19 at 13:08
  • To preserve the original text (key) order you can use a separate array using sequential numbered indexes to store `str` in the order seen, like `order[n++] = str;` Then in `END` you can loop over the second array in-order (e.g. `for (i = 0; i < n; i++) print a[order[i]], order[i]`) to preserve the original text order. – David C. Rankin Jul 04 '19 at 15:04
  • I tested this solution on my data with thousands of lines and it worked really well, really fast. – Tony Jul 04 '19 at 23:53
  • Glad it helped. An `awk` solution alone will always be more efficient than a combination of shell utilities strung together in a script or with pipes. – David C. Rankin Jul 05 '19 at 02:16
1

Here is a simple awk script that do the task:

script.awk

{                          # for each input line
    inpText = substr($0, length($1)+2);  # read the input text after 1st field
    inpArr[inpText] = inpArr[inpText] + 0 + $1; # accumulate the 1st field in array
}
END {                     # post processing
    for (i in inpArr) {   # for each element in inpArr
        print inpArr[i], i; # print the sum and the key
    }
}

input.txt

25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee

running:

awk -f script.awk input.txt

output:

75 sign on the dotted
37 cup of coffee
30 take a test
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
1

Using datamash is relatively succinct. First use sed to change the first space to a tab, (for this job datamash must have one and only one tab separator), then use -s -g2 to sort groups by the 2nd field, (i.e. "cup" etc.), then use sum 1 to add up the first column numbers by group, and it's done. No, not quite -- the number column migrated to the 2nd field for some reason, so reverse migrates it back to the 1st field:

sed 's/ /\t/' file | datamash -s -g2 sum 1 | datamash reverse

Output:

37  cup of coffee
75  sign on the dotted
30  take a test
agc
  • 7,973
  • 2
  • 29
  • 50
0

You can do the following (assume the name of the file is file.txt):

for key in $(sort -k2  -u file.txt   | cut -d ' ' -f2)
do 
    cat file.txt|grep $key  | awk '{s+=$1} END {print $2 "\t" s}'
done

Explanation: 1. get all unique keys (cup of coffee, sign on the dotted, take a test):

sort -k2  -u file.txt   | cut -d ' ' -f2

2. grep all lines with unique key from the file:

cat file.txt | grep $key 

3. Sum the lines using awk where $1=number column and $2 = key

awk '{s+=$1} END {print $2 "\t" s}'
  1. Put everything in for loop and iterate over the unique keys

Note: If a key can be a sub-string of another key, for example "coffee" and "cup of coffee" you will need to change step 2 to grep with regex

Erez Ben Harush
  • 833
  • 9
  • 26
0

you mean something like this?

#!/bin/bash

# define a dictionary
declare -A dict

# loop over all lines
while read -r line; do

   # read first word as value and the rest as text
   IFS=' ' read value text <<< "$line"

   # use 'text' as key, get value for 'text', default 0
   [ ${dict[$text]+exists} ] && dictvalue="${dict[$text]}" || dictvalue=0

   # sum value

   value=$(( $dictvalue + value )) 

   # save new value in dictionary
   dict[$text]="$value" 
done < data.txt  

# loop over dictionary, print sum and text
for key in "${!dict[@]}"; do
   printf "%s %s\n" "${dict[$key]}" "$key"
done

output

37 cup of coffee
75 sign on the dotted
30 take a test
UtLox
  • 3,744
  • 2
  • 10
  • 13
0

Another version based on the same logic as mentioned here @David.
Changes: It omits loops to speed up the process.

awk '
{
  text=substr($0, index($0,$2))
  if(!(text in text_sums)){ texts[i++]=text }
  text_sums[text]+=$1
}
END {
 for (i in texts) print text_sums[texts[i]],texts[i] 
}' input.txt  

Explanation:
substr returns the string starting with field 2. i.e. text part
array texts stores text on integer index, if its not present in text_sums array.
text_sums keep adding field 1 for a corresponding text.

Reason behind a separate array to store text as value backed by consecutive integer as index, is to assures the order of value (text) while accessing in same consecutive order.

See Array Intro

Footnotes says:

The ordering will vary among awk implementations, which typically use hash tables to store array elements and values.

user0
  • 159
  • 2