2

I'd like to make a statistics of the words from all the txt file from the current directory and its subdiretories.

In [39]: ls
about.txt            distutils/           installing/          whatsnew/
bugs.txt             extending/           library/             word.txt
c-api/               faq/                 license.txt          words_frequency.txt
contents.txt         glossary.txt         reference/
copyright.txt        howto/               tutorial/
distributing/        install/             using

I firstly tried command:

 In [46]: !grep -Eoh '[a-zA-Z]+' *.txt | nl

There's a problem that files in the subdiretories were not found:

 In [45]: !echo *.txt
 about.txt bugs.txt contents.txt copyright.txt glossary.txt license.txt word.txt words_frequency.txt

I improved it as:

In [48]: ! echo */*.txt | grep "about.txt"
In [49]:

Problem again, It failed to find the files of Level one directory and cannot traverse the files of random length.

It's interesting that python has a soluton to this problem:

In [50]: files = glob.glob("**/*.txt", recursive=True)
In [54]: files.index('about.txt')
Out[54]: 4

It could traverse dirs recursively to find all txt files.

However, python is cumbersome to move around files and change text data as grep "pattern" *.txt

How to enable the wildcards as greedy as a recursive behavior.

As an alternative, find command helps

find . -regex -E ".*\.txt" -print0 -exec grep -Eoh "{}" "[a-zA-Z]+" | nl \;

Which not handy as a greedy wildcards if possible.

The globstar could not be activated on Macos.

$ shopt -s globstar
-bash: shopt: globstar: invalid shell option name
$ bash --version
GNU bash, version 4.4.19(1)-release (x86_64-apple-darwin17.3.0)
AbstProcDo
  • 19,953
  • 19
  • 81
  • 138

1 Answers1

1

If I understood the question correctly you may use something like this:

find -type f -name '*.txt' -exec /bin/grep -hEo '\w+' {} \; \
  | sort \
  | uniq -c \
  | sort -k1,1n
hek2mgl
  • 152,036
  • 28
  • 249
  • 266