hmm…… what speeds do you need ? i created a synthetic 77.6 GB
file with nearly 525 mn
rows with plenty of unicode :
rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.
and randomly selected rows at an avg. rate of 1 every 3^5
, using rand()
not just NR % 243
, to place the string db_pd.Clients
at a random position in the middle of the existing text, totaling 2.16 mn rows
where the regex pattern hits
rows = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.
% dtp; pvE0 < testfile_gigantic_001.txt|
mawk2 '
_^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','
in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%
out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372
And mawk2
took just 59 seconds
to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.
At throughput rates of 1.3GiB/s
, as seen above calculated by pv
, it might even be detrimental to use utils like parallel
to split the tasks.