0

I am new to working with the Terminal on mac and have a large .tsv file that consists of a list of items, and two values next to it. I would like to be able to print the number of duplicates next to the first occurrence of the item Without removing additional data.

I am aware of cut -f 1 | sort | uniq -c but this removes a lot of valuable data I would like to keep for analysis. I'm reading about awk and grep but I think I could use a little help.

This is an example of the file I'm trying to process:

fruit   number  reference
apple   12  342
apple   13  345
apple   43  4772
banana  19  234
banana  73  3242
peach   131 53423
peach   234 3266
peach   242 324
peach   131 56758
peaches 29  2434

Ideally, the output would look something like this:

fruit   number  reference   fruit_count
apple   12  342 3
apple   13  345 
apple   43  4772    
banana  19  234 2
banana  73  3242    
peach   131 53423   4
peach   234 3266    
peach   242 324 
peach   131 56758   
peaches 29  2434    1

Is something like this even possible? I can get the desired output excel using formulas, but the file is too large and keeps crashing on me. Any help would be appreciated.

EDIT: Adding My current solution (that does not meet my requirements)

cut -f 1 fruitsample.txt | sort | uniq -c | sed -e 's/ *//' -e 's/ / /'

This gives me the intended counts, replacing the standard count + space output from uniq -c with a tab character, but it also sorts the header row and removes the 2nd and third columns.

On Excel, I can use the formula =IF(COUNTIF(A$2:A2,A2)=1,COUNTIF(A:A,A2),"") and fill it down. The file I'm working with is nearly 680K rows of data, and Excel chokes trying to calculate that many rows.

As I mentioned, I am a beginner looking for guidance. I'm just not as familiar with awk or grep. Thanks again!

mikelcal
  • 5
  • 4
  • 1
    I've voted to close this question because it appears to be a request for a recommendation for a tool or solution, rather than a request for assistance with your own code. This makes your question off-topic for StackOverflow. If that assessment was incorrect, and you do indeed want help writing your own code, then please [add your work so far to your question](http://stackoverflow.com/posts/43058786/edit) and I'll gladly retract my close vote. – ghoti Mar 28 '17 at 10:32
  • @ghoti Added the command I was using. – mikelcal Mar 28 '17 at 17:15
  • Close vote retracted. – ghoti Mar 28 '17 at 17:46

3 Answers3

2

awk to the rescue!

awk 'NR==FNR {a[$1]++; next} 
     FNR==1  {print $0, "fruit_count"; next} 
     $1 in a {$(NF+1)=a[$1]; delete a[$1]}1' file{,} | 
column -t

fruit    number  reference  fruit_count
apple    12      342        3
apple    13      345
apple    43      4772
banana   19      234        2
banana   73      3242
peach    131     53423      4
peach    234     3266
peach    242     324
peach    131     56758
peaches  29      2434       1

for explanation of the main idea I'll use a simpler structure without header, and unsorted data

$ cat file
apple
banana
apple
apple
cherry
banana

$ awk 'NR==FNR {a[$1]++; next}            # in the first pass, save key counts
                $1 in a                   # if the key in map
                        {$(NF+1)=a[$1];   # add the count as a last column
                         delete a[$1]}    # remove key from map
                1                         # print
       ' file{,} |                        # bash shorthand for: file file
  column -t                               # pretty print columns 


apple   3
banana  2
apple
apple
cherry  1
banana

for the simplified example, using unix toolchain you can achieve the same with

join -a1 -11 -22 -o1.2,2.1 <(cat -n file) <(cat -n file | sort -k2 | uniq -c -f1)

adding header will require more juggling; it's where awk shines.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • If you know of similar posts, could you please mark this as a dupe? And if not, please add an explanation to your answer. I'm having trouble understanding how you got from the code in the question to what you've included in your answer. :-P – ghoti Mar 28 '17 at 08:07
  • It's easier for me to write it up than searching the duplicates. For some reason SO doesn't let you search within your own answers (or I don't know how). – karakfa Mar 28 '17 at 14:30
  • 1
    You mean, like [this](http://stackoverflow.com/search?q=user%3A1435869+duplicate+count)? – ghoti Mar 28 '17 at 15:38
  • #SOreadytohelp! :-D I will relish this rare experience. Not often that I get to teach you anything! – ghoti Mar 28 '17 at 17:44
  • Am I doing something wrong? I just tried this, but it is not returning any output. :( I tried `$ awk 'NR=FNR {a[$1]++; next} FNR==1 {print $0, "fruit_count"; next} $1 in a {$(NF+1)=a[$1]}1' fruitysample.txt` and it just returned empty. – mikelcal Mar 28 '17 at 18:16
  • you omitted `{,}` after file name. If you don't like that shorthand repeat the file name two times. – karakfa Mar 28 '17 at 18:20
  • Tried both `fruitysample.txt{,}` and `fruitysample.txt fruitysample.txt` still not getting an output. Could it be the version of awk I'm using? `awk --version awk version 20070501 ` – mikelcal Mar 28 '17 at 18:35
  • not sure which `awk` is this? what OS are you on? there shouldn't be any `gawk` specific functions in the script. Did you transfer the file from windows? you may need to fix the line endings with `dos2unix` – karakfa Mar 28 '17 at 18:51
  • Installed gawk using brew, I'm now using GNU Awk 4.1.4. The file is a regular tab separated text file I made with textedit on MacOS 10.12.4 – mikelcal Mar 28 '17 at 19:11
  • copy/paste from the answer posted still works for me. It's weird if it's not printing anything since the `1` at the end should print the lines even if counts not added. – karakfa Mar 28 '17 at 19:15
  • Careful with the white space in the commented version. The comment + newline after `$1 in a` changes the behavior of the script. Just move the subsequent `{` to between the `$1 in a` and the `#`. – Ed Morton Mar 28 '17 at 20:43
0

Another using awk and double-tacs:

$ tac file | awk '
NR>1 {print q, (p==$1?"":++c)}                  # p previous first field, q previous record
     {c=(p==$1?c+1:0); p=$1; q=$0}              # c is the counter
END  {print q, "fruit_count"}
' | tac
fruit   number  reference fruit_count
apple   12  342 3
apple   13  345
apple   43  4772
banana  19  234 2
banana  73  3242
peach   131 53423 4
peach   234 3266
peach   242 324
peach   131 56758
peaches 29  2434 1
James Brown
  • 36,089
  • 7
  • 43
  • 59
0

This does what you want in a single pass of the input file and only stores the values for 1 fruit at a time in memory so it shouldn't have performance or memory issues despite your file being too big for MS-Excel:

$ cat tst.awk
NR==1 { print $0, "fruit_count"; next }
$1 != prev { prt() }
{ arr[++cnt] = $0; prev = $1 }
END { prt() }
function prt(    i) {
    if (cnt) {
        print arr[1], cnt
        for (i=2; i <= cnt; i++) {
            print arr[i]
        }
        delete arr
        cnt = 0
    }
}

$ awk -f tst.awk file | column -t
fruit    number  reference  fruit_count
apple    12      342        3
apple    13      345
apple    43      4772
banana   19      234        2
banana   73      3242
peach    131     53423      4
peach    234     3266
peach    242     324
peach    131     56758
peaches  29      2434       1
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks! this works, but does this require the file to be sorted first? I found in another post that it is possible to [sort within awk ignoring the header](http://stackoverflow.com/questions/14562423/is-there-a-way-to-ignore-header-lines-in-a-unix-sort/27368739#27368739) piping it through, like `NR<2{print $0;next}{print $0| "sort -r"}` I wrote the first line again as `NR<2 { print $0, "fruit_count"; next }{ print $0|"sort"}` but it just stacks the sorted data under the original output with the counts. – mikelcal Mar 28 '17 at 18:29
  • No necessarily sorted but grouped so that all lines with the same key ($1) are contiguous, just like the example you provided of how your data is arranged. If that's not what your data looks like then of course you should edit your question so your example better reflects your real data so we're not trying to help you solve a different problem from the one you actually have! – Ed Morton Mar 28 '17 at 20:30
  • I'd like to know more about what this program is doing. From my understanding, the first line prints the header, then the second part starts a loop to check whether the current item is the same as the previous one, if yes, it increases the count. Is that correct? I'm not entirely sure what the loop in the middle is doing `for (i=2; i <= cnt; i++)` why is the value of i set to 2 to start the loop? – mikelcal Mar 29 '17 at 05:51
  • 1
    Thanks for the recommendation @ed-morton! – mikelcal Mar 31 '17 at 18:31