I try to filter a text file based on a second file. The first file contains paragraphs like:
$ cat paragraphs.txt
# ::id 1
# ::snt what is an example of a 2-step garage album
(e / exemplify-01
:arg0 (a / amr-unknown)
:arg1 (a2 / album
:mod (g / garage)
:mod (s / step-01
:quant 2)))
# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
:arg0 (a / amr-unknown)
:arg1 (a2 / album
:mod (p / person
:name (n / name
:op1 "abwe"))))
The second file contains a list of strings like this:
$ cat list.txt
# ::snt what is an example of a abwe album
# ::snt what is an example of a acid techno album
I now want to filter the first file and only keep the paragraphs, if the snt is contained in the second file. For the short example above, the output file would look like this (paragraphs separated by empty line):
$ cat filtered.txt
# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
:arg0 (a / amr-unknown)
:arg1 (a2 / album
:mod (p / person
:name (n / name
:op1 "abwe"))))
So, I tried to loop through the second file and used awk to print out the paragraphs, but apparently the check does not work (all paragraphs are printed) and in the resulting file the paragraphs are contained multiple times. Also, the loop does not terminate... I tried this command:
while read line; do awk -v x=$line -v RS= '/x/' paragraphs.txt ; done < list.txt >> filtered.txt
I also tried this plain awk script:
awk -v RS='\n\n' -v FS='\n' -v ORS='\n\n' 'NR==FNR{a[$1];next}{for(i in a)if(index($0,i)) print}' list.txt paragraphs.txt > filtered.txt
But, it only takes the first line of the list.txt file.
Therefore, I need your help... :-)
UPDATE 1: from comments made by OP:
- ~526,000 entries in
list.txt
- ~555,000 records in
paragraphs.txt
- all lines of interest start with
# ::sn
(list.txt
,paragraphs.txt
) - matching will always be performed against the 2nd line of a paragraph (
paragraphs.txt
)
UPDATE 2: after trying the solutions on the files as stated in first update (4th-run timing):
fastest command:
awk -F'\n' 'NR==FNR{list[$0]; next} $2 in list' list.txt RS= ORS='\n\n' paragraphs.txt
time: 8,71s user 0,35s system 99% cpu 9,114 total
second fastest command:
awk 'NR == FNR { a[$0]; next }/^$/ { if (snt in a) print rec; rec = snt = ""; next }/^# ::snt / { snt = $0 }{ rec = rec $0 "\n" }' list.txt paragraphs.txt
time: 14,17s user 0,35s system 99% cpu 14,648 total
third fastest command:
awk 'FNR==NR { if (NF) a[$0]; next }/^$/ { if (keep_para) print para; keep_para=0; para=sep=""}$0 in a { keep_para=1 }{ para=para $0 sep; sep=ORS }END{ if (keep_para) print para }' list.txt paragraphs.txt
time: 15,33s user 0,35s system 99% cpu 15,745 total