If I understood what you mean, and you want to get a list of items, sorted by frequency, you can pipe through something like:
| sort | uniq -c | sort -k1nr
Eg:
Input:
file1
file2
file1
file1
file3
file2
file2
file1
file4
Output:
4 file1
3 file2
1 file3
1 file4
Update
By the way, what are you using awk for?
find . -name 'quest*' | cut -d_ -f1 | sort | uniq -c | sort -k1nr | head -n10
Returns the 10 items found more often.
Update
Here it is a much improved version. Only drawback, it's not sorting by number of occurrences. However, I'm going to figure out how to fix it :)
find . -name 'question*' | sort \
| sed "s#\(.*/question\([0-9]\+\)_[0-9]\+\)#\2 \1#" \
| awk '{ cnt[$1]++; files[$1][NR] = $2 } END{for(i in files){ print i" ("cnt[i]")"; for (j in files[i]) { print " "files[i][j] } }}'
Update
After testing on ~1.4M records (it took 23''), I decided that awk was too inefficient to handle all the grouping stuff etc. so I wrote that in Python:
#!/usr/bin/env python
import sys, re
file_re = re.compile(r"(?P<name>.*/question(?P<id>[0-9]+)_[0-9]+)")
counts = {}
files = {}
if __name__ == '__main__':
for infile in sys.stdin:
infile = infile.strip()
m = file_re.match(infile)
_name = m.group('name')
_id = m.group('id')
if not _id in counts:
counts[_id] = 0
counts[_id]+=1
if not _id in files:
files[_id] = []
files[_id].append(_name)
## Calculate groups
grouped = {}
for k in counts:
if not counts[k] in grouped:
grouped[counts[k]] = []
grouped[counts[k]].append(k)
## Print results
for k, v in sorted(grouped.items()):
for fg in v:
print "%s (%s)" % (fg, counts[fg])
for f in sorted(files[fg]):
print " %s" % f
This one does all the job of splitting, grouping and sorting.
And it took just about 3'' to run on the same input file (with all the sorting thing added).
If you need even more speed, you could try compiling with Cython, that is usually at least 30% faster.
Update - Cython
Ok, I just tried with Cython.
Just save the above file as calculate2.pyx
. In the same folder, create setup.py
:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("calculate2", ["calculate2.pyx"])]
)
And a launcher script (I named it calculate2_run.py
)
import calculate2
import sys
if __name__ == '__main__':
calculate2.runstuff(sys.stdin)
Then, make sure you have cython installed, and run:
python setup.py build_ext --inplace
That should generate, amongst other stuff, a calculate2.so
file.
Now, use calculate2_run.py
as you normally would (just pipe in the results from find).
I run it, without any further optimization, on the same input file: this time, it took 1.99''.