all, I am looking for an efficient way to organize and filter certain types of text files.
Let's say I have 10,000,000 text files that are concatenated to larger chunks that are formatted like this
@text_file_header
ID0001
some text
...
@text_file_header
ID0002
some text
...
@text_file_header
ID0003
some text
...
Now, I perform a certain operations on those files so that I end up with 200 x 10,000,000 text files (in chunks) -- each text file has "siblings" now
@text_file_header
ID0001_1
some text
...
@text_file_header
ID0001_2
some text
...
@text_file_header
ID0001_3
some text
...
@text_file_header
ID0002_1
some text
...
@text_file_header
ID0002_2
some text
...
@text_file_header
ID0002_3
some text
However, for certain tasks, I only need certain text files, and my main question is how I can extract them based on an "id" in the text files (e.g., grep ID0001_* and ID0005_* and ID0006_* and so on).
SQLite would be one option, and I also already have an SQLite database with ID and file columns, however, the problem is that I need to do this computation where I generate those 200 * 10,000,000 text files on a cluster due to time constraints. The file I/O for SQLite would be too limiting right now.
My idea was now to split those files into 10,000,000 inidividual files like so
gawk -v RS="@<TRIPOS>MOLECULE" 'NF{ print RS$0 > "file"++n".txt" }' all_chunk_01.txt
and after I generated those 200 "siblings", I would do a
cat
in the folder based on the file IDs that I would be currently interested in. Let's say I need the corps of 10,000 out of the 10,000,000 text files, I would cat them together to a single document that I need for further processing steps.
Now, my concern is if it is a good idea at all to store 10,000,000 individual files in a single folder on a disk and perform the cat, or would it be better to grep
out the files based on an ID from let's say 100 multitext files?