Estimating size of a large text file

Question

From the comment on a question about selecting n random lines from a text file here:

User commented that they used the shuf command to randomly select lines from a text file having 78 billion lines in less than a minute.

I see from various sources on the internet that people have text file sizes varying from 100GB-200GB-300GB for mere 7-15 Billion lines, depending on the metadata.

Assuming we have:

Text file containing ASCII characters, where we define a line break for newline after each 100 characters. This file has 78 Billion lines.
We have system with compute capacity as :

a. RAM - 16GB

b. Processor - 2.5 GHz Intel Core i7

c. Disk - 500GB SSD

I am curious as to :

What will be the estimate size of the text file?

Will it also depend on how different OSs encode and store characters at memory level? If yes, then how much will it factor into the size calculation?

Ideally how much time bash - running on the system (with above mentioned specifications) - will take to process this text file with shuf command?
If the text file size comes in TBs, then how to feed data to the system? How will shuf operate for such large files with maximum efficiency in the mentioned system.

shuf performance in my system(specs above):

100 characters per line : 1 line

 FILE SIZE : ~ 100Bytes

  TIME : real 0m0.025s  user 0m0.007s   sys 0m0.013s
100 characters per line : 100,000 lines 

FILE SIZE : ~10MB

TIME :  real 0m0.122s  user 0m0.036s  sys 0m0.080s
100 characters per line : 100,000,000 lines 

FILE SIZE : ~10GB 

TIME : real 9m37.108s user 2m22.011s sys 3m3.659s

[nit : For those who are interested, here are some reddit meme threads for the same : https://www.reddit.com/r/ProgrammerHumor/comments/grsreg/shuf_go_brrr/
https://www.reddit.com/r/ProgrammerHumor/comments/groe6y/78_billion_prayers/ ]

Edit#1: Refining question to have more detail. Adding more resources as per comments and findings.

Edit#2: Added shuf performance in my system for different text file sizes

It doesn't have anything to do with bash, when you do `sort -R file` for example, it's `sort` that processes the file, bash doesn't even see the file content — oguz ismail, May 28 '20 at 03:28
`1.` mixes concepts. "raw data" implies an unformatted binary file that will have no concept of lines. A text file is a formatted file with lines (regardless of length separated by the `'\n'` character). You can get an idea of the file size by the product of the number of lines times the average length from your random test. `2.` for extremely large files, shell tools can use a sliding-window approach to keep X-number of lines in a buffer allowing you to scroll forward/backwards giving the appearance the whole file is in memory. — David C. Rankin, May 28 '20 at 03:33
Hey David, thanks for calling out. Removed 'raw data', my question is focussed entirely on lines in a text file and calculation around that. Raw data might mislead to concept of blob. Also, the question remains as to how this large size data is processed and the estimated size. — ampathak, May 28 '20 at 04:18
Without knowing the structure of the input or format, a good answer could be impossible. But lets assume the file is something like plain ASCII, has 80 char per line, 1 byte per char and there are 80x10^9 lines. The data is random and no compression is in place. An empty file, just 1 byte for the line break at the end, would have 80x10^9 bytes, which are 80 GB. A file with 80 char per line would be 80x80x10^9 bytes, which are 6400 GB, 6.4 TB. — U880D, May 28 '20 at 04:37
I've the feeling that your question is more about IOps and how to do calculation with them. In the past I've worked with systems which could process around 5 GB/s, 60 sec x 5 GB/s = 300 GB/min. To read random data of 6.4 TB we've parallelized it over 20 boxes. — U880D, May 28 '20 at 04:47
@oguz, as per your feedback, have added more details to the question. Let me know if more granularity is required to help in understanding the context. — ampathak, May 28 '20 at 06:16
@U880D , Thanks, the 6.4TB calculation makes sense. Also, from this parallelization over 20 boxes(each capable of processing 300GB/min data), did you use "shuf" itself or use some other distributed algorithm to achieve the same across different boxes. — ampathak, May 28 '20 at 06:20
I didn't ask for useless links to code monkey discussions. Like, all tools you mentioned are open source ([shuf](https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/shuf.c?id=ae034822c535fa5), [sort](https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/sort.c?id=ae034822c535fa5), [bash](https://git.savannah.gnu.org/cgit/bash.git/tree/)), have you checked these to see what makes them so fast? Have you conducted any tests on your setup? Have you written any piece of code to make performance-wise comparisons with these tools? — oguz ismail, May 28 '20 at 06:34
@picoder, thanks for providing additional information. In our systems the limitation was usually the disk subsystem or network. The CPU cores, Linux system, tools, programs, algorithms and so on were capable to process the data and not the bottleneck. — U880D, May 28 '20 at 06:43
@picoder, it will depend on the data structure, use case and so on,which tools to use. If you have data sets which are "in one line", you can split that set over multiple file. Furthermore for distributed computing usually a distributed filesystem is in place, so multiple systems can read parts of the data set which might be also splited over multiple file. — U880D, May 28 '20 at 06:46
@picoder, there are already books which cover question like yours, i.e. [Data Science at the Command Line](https://www.datascienceatthecommandline.com/) or blog posts like [Command line tools can be .. faster ...](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html). — U880D, May 28 '20 at 06:50
@picoder, according your description I understand that you have a data set which is organized in lines to 100 char/line. Your data set could be stored in one single file. Since it currently seems there is no need to do this, it is recommend to split the data set into multiple files, i.e. 6400 á 1 GB. — U880D, May 28 '20 at 08:15
@picoder, please take also note about [Bash shuffle a file that is too large to fit in memory](https://stackoverflow.com/questions/40814785/). — U880D, May 28 '20 at 08:44

dash-o · Answer 1 · 2020-05-28T04:58:02.637

1

Not all commenters in the post that you reference agree about the performance. While one user commented on extra fast processing (78B is one minutes) another commented reported much slower results (1000 rows from 500M rows in 13 minutes).

You can try the following:(replace NNN with your favorite size): seq 1 NNN > 1 ; time shuf 1 > /dev/null

I'm getting: * For N=1,000,000 time = 0.2 sec * for N=10,000,000 time = 3.5 sec

Both in line with the 500M rows in 13 minutes.

Note that operation is CPU bound for 10M lines. If the file size exceed memory, it will slower.

Bottom line, most likely measurement error

edited May 28 '20 at 04:58

answered May 28 '20 at 04:43

dash-o

13,723
1
10
37

You can avoid the shuffle just by generating `N` random numbers with `% total_lines` (you may have to manipulate the results or use another random source since `$RANDOM` is limited to 16-bit values). Then using `sed` or `awk` to compute the average length of the chosen lines. It will still take a while to pull out the random lines. (worth a try to see if it saves time) – David C. Rankin May 28 '20 at 04:58

Estimating size of a large text file

1 Answers1