How to Grep & Cat text files based on an identifier line from a multi-text file

Question

all, I am looking for an efficient way to organize and filter certain types of text files.

Let's say I have 10,000,000 text files that are concatenated to larger chunks that are formatted like this

@text_file_header
ID0001
some text
...
@text_file_header
ID0002
some text
...
@text_file_header
ID0003
some text
...

Now, I perform a certain operations on those files so that I end up with 200 x 10,000,000 text files (in chunks) -- each text file has "siblings" now

@text_file_header
ID0001_1
some text
...
@text_file_header
ID0001_2
some text
...
@text_file_header
ID0001_3
some text
...
@text_file_header
ID0002_1
some text
...
@text_file_header
ID0002_2
some text
...
@text_file_header
ID0002_3
some text

However, for certain tasks, I only need certain text files, and my main question is how I can extract them based on an "id" in the text files (e.g., grep ID0001_* and ID0005_* and ID0006_* and so on).

SQLite would be one option, and I also already have an SQLite database with ID and file columns, however, the problem is that I need to do this computation where I generate those 200 * 10,000,000 text files on a cluster due to time constraints. The file I/O for SQLite would be too limiting right now.

My idea was now to split those files into 10,000,000 inidividual files like so

gawk -v RS="@<TRIPOS>MOLECULE" 'NF{ print RS$0 > "file"++n".txt" }' all_chunk_01.txt

and after I generated those 200 "siblings", I would do a cat in the folder based on the file IDs that I would be currently interested in. Let's say I need the corps of 10,000 out of the 10,000,000 text files, I would cat them together to a single document that I need for further processing steps. Now, my concern is if it is a good idea at all to store 10,000,000 individual files in a single folder on a disk and perform the cat, or would it be better to grep out the files based on an ID from let's say 100 multitext files?

You don't grep and cat to get those chunks, you use something like awk or perl or, as stark points out, use the source database itself. Since you mention file I/O, the I/O on 10 million files is going to be _far_ greater than the I/O on the database. — Stephen P, Feb 20 '15 at 20:22
The goal is to eventually read them into a database. I mean, I already have the 10,000,000 text files in a database, but not the generated 200-siblings. What I meant by I/O: I would have to read them into the database, index the database, filter, and write the filtered targets to the hard drive again. This will be done eventually, but right now, I am a little bit time-constrained — , Feb 20 '15 at 21:03
Can you use awk like http://stackoverflow.com/questions/23934486/is-a-start-end-range-expression-ever-useful-in-awk showed ? — Walter A, Feb 20 '15 at 21:21

score -1 · Answer 1 · answered Feb 20 '15 at 18:21

-1

For example:

grep TextToFind FileWhereToFind

returns what you want.

answered Feb 20 '15 at 18:21

Ósky F J

7
4

Yes, sure, but how do I extract certain chunks from the file. E.g., grepping for e.g., "ID0001_1 and all the following lines until next ID" + "ID0123_9 and all the following lines until next ID" – Feb 20 '15 at 18:57

How to Grep & Cat text files based on an identifier line from a multi-text file

1 Answers1