0

all, I am looking for an efficient way to organize and filter certain types of text files.

Let's say I have 10,000,000 text files that are concatenated to larger chunks that are formatted like this

@text_file_header
ID0001
some text
...
@text_file_header
ID0002
some text
...
@text_file_header
ID0003
some text
...

Now, I perform a certain operations on those files so that I end up with 200 x 10,000,000 text files (in chunks) -- each text file has "siblings" now

@text_file_header
ID0001_1
some text
...
@text_file_header
ID0001_2
some text
...
@text_file_header
ID0001_3
some text
...
@text_file_header
ID0002_1
some text
...
@text_file_header
ID0002_2
some text
...
@text_file_header
ID0002_3
some text

However, for certain tasks, I only need certain text files, and my main question is how I can extract them based on an "id" in the text files (e.g., grep ID0001_* and ID0005_* and ID0006_* and so on).

SQLite would be one option, and I also already have an SQLite database with ID and file columns, however, the problem is that I need to do this computation where I generate those 200 * 10,000,000 text files on a cluster due to time constraints. The file I/O for SQLite would be too limiting right now.

My idea was now to split those files into 10,000,000 inidividual files like so

gawk -v RS="@<TRIPOS>MOLECULE" 'NF{ print RS$0 > "file"++n".txt" }' all_chunk_01.txt

and after I generated those 200 "siblings", I would do a cat in the folder based on the file IDs that I would be currently interested in. Let's say I need the corps of 10,000 out of the 10,000,000 text files, I would cat them together to a single document that I need for further processing steps. Now, my concern is if it is a good idea at all to store 10,000,000 individual files in a single folder on a disk and perform the cat, or would it be better to grep out the files based on an ID from let's say 100 multitext files?

  • 1
    efficient == database. inefficient = 10,000,000 text files. – stark Feb 20 '15 at 18:21
  • You don't grep and cat to get those chunks, you use something like awk or perl or, as stark points out, use the source database itself. Since you mention file I/O, the I/O on 10 million files is going to be _far_ greater than the I/O on the database. – Stephen P Feb 20 '15 at 20:22
  • The goal is to eventually read them into a database. I mean, I already have the 10,000,000 text files in a database, but not the generated 200-siblings. What I meant by I/O: I would have to read them into the database, index the database, filter, and write the filtered targets to the hard drive again. This will be done eventually, but right now, I am a little bit time-constrained –  Feb 20 '15 at 21:03
  • Can you use awk like http://stackoverflow.com/questions/23934486/is-a-start-end-range-expression-ever-useful-in-awk showed ? – Walter A Feb 20 '15 at 21:21

1 Answers1

-1

For example:

grep TextToFind FileWhereToFind

returns what you want.

Ósky F J
  • 7
  • 4
  • Yes, sure, but how do I extract certain chunks from the file. E.g., grepping for e.g., "ID0001_1 and all the following lines until next ID" + "ID0123_9 and all the following lines until next ID" –  Feb 20 '15 at 18:57