From the comment on a question about selecting n random lines from a text file here:
Select random lines from a file
User commented that they used the shuf command to randomly select lines from a text file having 78 billion lines in less than a minute.
I see from various sources on the internet that people have text file sizes varying from 100GB-200GB-300GB for mere 7-15 Billion lines, depending on the metadata.
Assuming we have:
Text file containing ASCII characters, where we define a line break for newline after each 100 characters. This file has 78 Billion lines.
We have system with compute capacity as :
a. RAM - 16GB
b. Processor - 2.5 GHz Intel Core i7
c. Disk - 500GB SSD
I am curious as to :
- What will be the estimate size of the text file?
Will it also depend on how different OSs encode and store characters at memory level? If yes, then how much will it factor into the size calculation?
Ideally how much time bash - running on the system (with above mentioned specifications) - will take to process this text file with shuf command?
If the text file size comes in TBs, then how to feed data to the system? How will shuf operate for such large files with maximum efficiency in the mentioned system.
shuf performance in my system(specs above):
100 characters per line : 1 line
FILE SIZE : ~ 100Bytes
TIME : real 0m0.025s user 0m0.007s sys 0m0.013s
100 characters per line : 100,000 lines
FILE SIZE : ~10MB
TIME : real 0m0.122s user 0m0.036s sys 0m0.080s
100 characters per line : 100,000,000 lines
FILE SIZE : ~10GB
TIME : real 9m37.108s user 2m22.011s sys 3m3.659s
[nit : For those who are interested, here are some reddit meme threads for the same :
https://www.reddit.com/r/ProgrammerHumor/comments/grsreg/shuf_go_brrr/
https://www.reddit.com/r/ProgrammerHumor/comments/groe6y/78_billion_prayers/ ]
Edit#1: Refining question to have more detail. Adding more resources as per comments and findings.
Edit#2: Added shuf performance in my system for different text file sizes