0

I have a shell script which checks every file in a folder for the word "Author" counting the number of times Author appears per file and printing this out one line by file. The number has "hotel_$i" as a prefix, where i is 1 at the top of the list and then increases as you go down the list. Here is my script:

#!/bin/bash
paste <(printf 'hotel_%d\n' {1..825}) \
<(find . -type f -exec bash -c 'grep -wo "Author" {} | wc -l' \; | sort -nr)

The problem is that I have 828 output lines (suggesting there are 828 files in my folder) when there are only 825 files in the folder. Here is my output:

hotel_1   2686
...(hotel_2 - hotel_824 output lines)
hotel_825  13
        1
        1
        0

I assume that the 2 1's and the 0 are the "extra" files (perhaps not), why do they appear and how do I get rid of them? How is it possible for there to be more files in my folder than there actually appears?

John Smith
  • 679
  • 1
  • 9
  • 17
  • I do not know much about bash or the Unix file system but could it be your complicated command ends up in a temporary file at some point (because of all the piping) and the 1's are for the number of times "Author" occurs in your command? The zero may be for the directory entry itself (if that is a thing in Unix). – Martin Maat Feb 16 '16 at 17:02
  • 1
    Are all the files in a single folder or in a group of folders? If in a single folder, just use a glob `hotel_*` rather than find... – dawg Feb 16 '16 at 17:03
  • What does `find . -type f -print | wc -l` tell you - 825 or 828 or something else? Does `ls | wc -l` say the same? How about `find . -type d -print`? How about `diff <(ls | sort -u) <(find . -printf "%f\n" | sort -u)` and so on to debug.... – Ed Morton Feb 16 '16 at 17:08
  • I see you're using the answer you go to this question: http://stackoverflow.com/q/35420891/1745001. You asked that question and accepted the first answer you got. You might want to hold off a bit on accepting answers to see if other people have different suggestions - the first answer you get MIGHT not always be the best one. – Ed Morton Feb 16 '16 at 17:19
  • @EdMorton If you look at OP's question history, you'll notice that they're asking many closely related questions, always building on the last answer they got, leading to some probably heavily suboptimal overall "design". – Benjamin W. Feb 16 '16 at 18:56
  • The output of the `find -exec` command is not one per input file, but one per time the string `Author` is found in any of these files. – Benjamin W. Feb 16 '16 at 18:58
  • When I read http://stackoverflow.com/q/35420891/1745001 it looks like the hotel_xxx are strings not related to the filenames. Can you update your question and tell that you want to generate a string independant of the filenames? – Walter A Feb 16 '16 at 19:31

2 Answers2

1

Just use awk, e.g. with GNU awk for ENDFILE:

awk '/Author/{c++} ENDFILE{print "hotel_"ARGIND, c+0; c=0}' *

or if your files are actually named "hotel_*":

awk '/Author/{c++} ENDFILE{print FILENAME, c+0; c=0}' hotel_*

If that doesn't do what you want then edit your question to show some concise, testable, sample input and expected output so we can help you solve your problem the right way. Your current approach is a wrong way.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

Just try

$ for e in hotel_{1..825}; do echo "$e"; grep -wo "Author" "$e" | wc -l; done

Not tested...


If you want to sort those by the number of matches you can do:

$ for e in hotel_{1..825}; do printf "$e "; printf "%d\n" $(grep -wo "Author" "$e" | wc -l); done | sort -nr -k 2
dawg
  • 98,345
  • 23
  • 131
  • 206