1

I have 20 files. In each file I have a list of the occurring words and their frequency.

Example:

 2 représentant 
 3 reproduire 
 2 réseau 
 1 rester 
 3 reste 
 1 résumer 

I used this command to merge between these 20 files

cat *.txt > tous.txt | sort | uniq -ci  | sort -k3

The result was for example:

2  2 représentant 
1  6 représentant 
5  3 reproduire
2  3 reproduire  
6  3 réseau
1  1 réseau
etc..

But what I want is to make it calculate the number of occurrence of each word without writing it many times. What I want is:

8 representant
6 reproduire
4 réseau
... 

I can do it with awk:

awk '{tab[$2]+=$1} END {for(i in tab){printf("%7d %s\n", tab[i], i) | "sort -k2"}}' ~/Bureau/Projet/data/dico/*.dico.forme.txt > ~/Bureau/Projet/data/input/black.txt

Any other suggestions by using for ex if?

Cyrus
  • 84,225
  • 14
  • 89
  • 153
Dreem AT
  • 25
  • 5
  • 1
    You cannot accomplish this using a simple sort and uniq pipeline. There is no good way to use only those tools to grab the preexisting counts and sum them. You could do it with a more complex Bash script, but Awk is the simpler and likely best way to approach it. – David Hoelzer Jan 02 '18 at 00:57
  • 2
    Based on your description, shouldn't the count of total number word occurances be: `10 représentant`, `21 reproduire`, `19 réseau` That is, the total of the word occurrences listed in each line times the number of times that those counts occur, _plus_ any other such counts for the same word? – bill.lee Jan 02 '18 at 01:07
  • é will not count as e – once Jan 02 '18 at 03:41
  • 1
    Please show us the exact command you used. The command in your question: `cat *.txt > tous.txt | sort | uniq -ci | sort -k3` will not do what you say it does. The output of `cat *.txt` is written to `tous.txt`, and will not be available as the input to `sort`. – Keith Thompson Jan 02 '18 at 04:13

2 Answers2

1

The simplest way is don't do the counting in the first place. There seems to be no easy way to do it with uniq but you can count it using Awk or loops.

  1. Combine all the data (assume space-separated)

    cat *.txt >all.txt
    
    cat all.txt  
    2 hi  
    2 test  
    3 try  
    3 hi  
    5 test  
    3 try
    
  2. Count again

    With Awk:

    sort -k2,2 all.txt | awk '{a[$2] += $1} END{for (i in a) print a[i],i}'
    

    Output:

      5 hi  
      7 test  
      6 try
    

... Or you can do it with a while loop (less efficient):

while read -r a; do
    echo "$(grep -w "$a" all.txt|cut -d ' ' -f1|paste -sd+|bc)" "$a"
done< <(cut -d ' ' -f2 all.txt|sort -u)

or reverse what uniq -c did:

while read -r a b; do
    yes "$b" |head -n "$a"
done <all.txt | sort| uniq -c
tripleee
  • 175,061
  • 34
  • 275
  • 318
once
  • 1,369
  • 4
  • 22
  • 32
  • I agree with "don't do the counting in the first place" but the rest is just horrible. – tripleee Jan 02 '18 at 07:22
  • I'm afraid not. You [read lines with `for`](http://mywiki.wooledge.org/DontReadLinesWithFor) and [don't quote your variables](https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable) and you still unnecessarily use a temporary file and fail to clean it up afterwards; but more generally, the OP's Awk script is already doing a much better job of solving the problem efficiently. – tripleee Jan 02 '18 at 08:58
  • 1
    Your Awk refactoring is a bit of an improvement but it loses the sort order. In any event, I'd be hesitant to remove my downvote as long as your shell script doesn't even pass http://shellcheck.net/ – tripleee Jan 02 '18 at 09:00
  • @tripleee all "No issues detected!" using #!/bin/bash now, sorry i dont know shellcheck – once Jan 02 '18 at 09:21
  • 1
    Reading the entire file on every iteration just to `grep` it is very inefficient. The use of `echo` with backticks also sets off some odd olfactory sensations, though it's not technically a [useless use of `echo`](http://www.iki.fi/era/unix/award.html#echo) – tripleee Jan 02 '18 at 09:30
  • 1
    I reluctantly removed my downvote for all the effort you put in, but really, the TLDR of this answer is, "I have no *better* solution than the OP's Awk script". – tripleee Jan 02 '18 at 10:16
  • The `grep` will be wrong if one word is a substring of another. The simplest workaround is to not try to fix that code because of the other issues with it. – tripleee Jan 02 '18 at 10:23
1

There is no need to store intermediate results in tous.txt and no need really to keep the entire array in memory, though this is a minor efficiency hack which won't make much difference unless your data set is large.

sort -k2,2 *.txt |
awk 'NR>1 && $2 != prev { print sum, prev; sum = 0 }
    { prev = $2; sum += $1 }
    END { print sum, prev }'

Notice how the END block repeats (part of) the main flow. (Missing the last output line is a common bug with this general approach.)

As already suggested by others, if you can avoid the *.txt files and go straight to a sort | uniq -c with the entire raw input, that might end up being more elegant and efficient.

tripleee
  • 175,061
  • 34
  • 275
  • 318