don't sort
for no reason :
nawk '_[$-__]--'
gawk '__[$_]++'
mawk '__[$_]++'
Mary
Mary
Mary
John
John
Lucy
for 1 GB+
files, u can speed things up a bit by preventing FS
from splitting unnecessary fields
mawk2 '__[$_]++' FS='\n'
for 100 GB
inputs, one idea would be to use parallel
to create, say, 10 instances of awk
, piping the full 100 GB
to each instance, but assigning each of them a particular range to partition on their end
(e.g. instance 4 handle lines beginning with F-Q
, etc), but instead of outputting it all THEN attempt to sort the monstrosity, what one could do is simply have them tally up, and only print out a frequency report of how many copies ("Nx"
) of each unique line ("Lx"
) has been recorded.
From there one could sort a much smaller file along the column holding the Lx
's, THEN pipe it to one more awk
that would print out Nx
# copies of each line Lx
.
probably a lot faster than trying to sort 100 GB
I created a test scenario by cloning 71 shuffled copies of a raw file with these stats :
uniq rows = 8125950. | UTF8 chars = 160950688. | bytes = 160950688.
—- 8.12 mn unique rows spanning 154 MB
……resulting in a 10.6 GB
test file :
in0: 10.6GiB 0:00:30 [ 354MiB/s] [ 354MiB/s] [============>] 100%
rows = 576942450. | UTF8 chars = 11427498848. | bytes = 11427498848.
even when using just 1 single instance of awk
, it finished filtering the 10.6 GB
in ~13.25 mins
- reasonable given the fact it's tracking 8.1 mn unique hash keys.
in0: 10.6GiB 0:13:12 [13.7MiB/s] [13.7MiB/s] [============>] 100%
out9: 10.5GiB 0:13:12 [13.6MiB/s] [13.6MiB/s] [<=> ]
( pvE 0.1 in0 < testfile.txt | mawk2 '__[$_]++' FS='\n' )
783.31s user 15.51s system 100% cpu 13:12.78 total
5e5f8bbee08c088c0c4a78384b3dd328 stdin