2

I try to filter a text file based on a second file. The first file contains paragraphs like:

$ cat paragraphs.txt
# ::id 1
# ::snt what is an example of a 2-step garage album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (g / garage)
            :mod (s / step-01
                  :quant 2)))

# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))

The second file contains a list of strings like this:

$ cat list.txt
# ::snt what is an example of a abwe album
# ::snt what is an example of a acid techno album

I now want to filter the first file and only keep the paragraphs, if the snt is contained in the second file. For the short example above, the output file would look like this (paragraphs separated by empty line):

$ cat filtered.txt
# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))

So, I tried to loop through the second file and used awk to print out the paragraphs, but apparently the check does not work (all paragraphs are printed) and in the resulting file the paragraphs are contained multiple times. Also, the loop does not terminate... I tried this command:

while read line; do awk -v x=$line -v RS= '/x/' paragraphs.txt ; done < list.txt >> filtered.txt

I also tried this plain awk script:

awk -v RS='\n\n' -v FS='\n' -v ORS='\n\n' 'NR==FNR{a[$1];next}{for(i in a)if(index($0,i)) print}' list.txt paragraphs.txt > filtered.txt

But, it only takes the first line of the list.txt file.

Therefore, I need your help... :-)


UPDATE 1: from comments made by OP:

  • ~526,000 entries in list.txt
  • ~555,000 records in paragraphs.txt
  • all lines of interest start with # ::sn (list.txt, paragraphs.txt)
  • matching will always be performed against the 2nd line of a paragraph (paragraphs.txt)

UPDATE 2: after trying the solutions on the files as stated in first update (4th-run timing):

fastest command:

awk -F'\n' 'NR==FNR{list[$0]; next} $2 in list' list.txt RS= ORS='\n\n' paragraphs.txt
time: 8,71s user 0,35s system 99% cpu 9,114 total

second fastest command:

awk 'NR == FNR { a[$0]; next }/^$/ { if (snt in a) print rec; rec = snt = ""; next }/^# ::snt / { snt = $0 }{ rec = rec $0 "\n" }' list.txt paragraphs.txt
time: 14,17s user 0,35s system 99% cpu 14,648 total

third fastest command:

awk 'FNR==NR { if (NF) a[$0]; next }/^$/    { if (keep_para) print para; keep_para=0; para=sep=""}$0 in a { keep_para=1 }{ para=para $0 sep; sep=ORS }END{ if (keep_para) print para }' list.txt paragraphs.txt
time: 15,33s user 0,35s system 99% cpu 15,745 total
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
niccip
  • 23
  • 5

3 Answers3

1

Assumptions:

  • paragraphs in the paragraphs.txt file are separated by at least one blank line
  • matches are performed on entire lines
  • contents of lines are not known in advance (additional comments from OP negate this assumption)
  • entries from list.txt could appear anywhere in a paragraph (additional comments from OP negate this assumption)

A couple issues with the current code:

  • for the while/awk loop try replacing /x/ with $0 ~ x; also make sure you wrap your bash variable reference in double quotes (ie, -v x=$line should be -v x="$line"); though a single awk call is going to be more efficient (it only requires a single pass through each file).

  • for the 2nd awk script -v RS='\n\n' -v FS='\n' -v ORS='\n\n' is going to apply to both input files so you won't be parsing list.txt correctly.

One awk idea:

awk '
FNR==NR { if (NF) a[$0]; next }             # if non-blank line then use entire line as array index
/^$/    { if (keep_para) print para         # blank line: if some part of current paragraph was found in a[] then print paragraph
          keep_para=0; para=sep=""          # reset variables
        }
$0 in a { keep_para=1 }                     # if current line found in a[] then set flag
        { para=para $0 sep; sep=ORS }       # save current line as part of current paragraph
END     { if (keep_para) print para }       # flush last paragraph to stdout?
' list.txt paragraphs.txt

NOTE: with the negation of some original assumptions this generalized approach will be less performant than other answers based on content specific to OP's particular data set

This generates:

# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Many thanks! It seems to work, but it is very slow. I have ~526,000 entries in the list file and ~555,000 records in the paragraphs file. The command now runs since 9 hours. It is not finished yet and I cannot check, if the result is correct. Are you sure, each file is only passed once? – niccip Aug 07 '22 at 07:42
  • @niccip ahhhh, yeah, file size is going to have a major impact on how we write the `awk` script; the new answer should be quite a bit faster; my 1st answer took the (relatively) easy approach based on the assumption we were talking about relatively small files; while each file is only parsed once the 1st answer did require that we scan the `a[]` array for each paragraph, with a worst case scenario that we scan the entire array, so the net result is that we could scan the entire in-memory copy of `list.txt` for each paragraph – markp-fuso Aug 07 '22 at 13:32
  • generally speaking ... with relatively simple operations like we're talking about here an `awk` script shouldn't take more than a couple minutes to complete (assuming input files of 10's of GBytes); if the `awk` script runs longer than a few minutes then there's something wrong with the `awk` script (eg, in this case a poor design based on an assumption about the (small) size of the input files) – markp-fuso Aug 07 '22 at 13:40
  • In some awks at least, field splitting only occurs if necessary so using `NF` is going to force awk to do field splitting (for both input files I expect), just like if you referenced a field like `$1`, and so make it slower. I don't see any indication in the question that list.txt could contain blank lines so you could probably just remove `if (NF)`. I'd be interested to hear how much of a speedup that produces, if any, if you wouldn't mind 3rd-run timing that compared to your original (benefits probably are awk version dependent). – Ed Morton Aug 07 '22 at 14:08
  • the script of the second answer was finished in a few seconds! wow! Many thanks for your help and explanation. – niccip Aug 07 '22 at 14:19
  • 1
    @EdMorton testing `if (NF) a[$0]` vs `if ($1) a[$0]` vs `a[$0]` ... differences < `0.20` secs; `awk 5.1.1`, `cywin` (in virtual Win10 client), avg of 10 runs of each test against same 500K simulated files – markp-fuso Aug 07 '22 at 14:19
  • @EdMorton I could try a third run. You mean, your answer but without the -F'\n' variable set? – niccip Aug 07 '22 at 14:21
  • @markp-fuso thanks for the timing info! In your sample, what percentage of records match across the 2 files? – Ed Morton Aug 07 '22 at 14:23
  • @niccip no, don't remove `-F'\n'`, just do exactly what it shows in [my answer](https://stackoverflow.com/a/73267996/1745001) and let us know the result, thanks. Make sure to run each script 3 times before taking the `time` output to remove any impact of cache-ing. – Ed Morton Aug 07 '22 at 14:24
  • 1
    @EdMorton running your code against my 500K simulated files is ~55% faster than my answer; obviously due to me testing each line of the paragraph instead of focusing on just the 2nd line of the paragraph; the 500K simulated files are just junk ... only get the single match as in OP's question – markp-fuso Aug 07 '22 at 14:28
  • Ah, I see. The OP would have to tell us of course but I'd think something like 50% matches might be more realistic. I actually wouldn't be surprised if the contents of list.txt were a subset of records in paragraphs.txt. – Ed Morton Aug 07 '22 at 14:32
  • Yes, the list.txt contains the sentences (snt) where there is a 'name' node in the AMR graph (which are contained in the paragraphs.txt). It is a subset of the sentences and graphs contained in the paragraphs.txt. I precalculated the list of those sentences in list.txt using a jupyter notebook and AMR library, as I needed to make sure 'name' is the label of a node and not something else in the graph. – niccip Aug 07 '22 at 15:06
  • 1
    If you just want to get the paragraphs from `paragraphs.txt` that contain a name node (presumably the `:name (n / name` line) you didn't have to create a separate list file, that'd be trivial to do with an awk script looking at just the paragraphs file. Ask a new question after accepting an answer to this one if you want help with that. – Ed Morton Aug 07 '22 at 15:44
  • @EdMorton unfortunately "name" nodes do not always look like the one in the example. Therefore, I chose to parse the AMRs and analyse the graphs for the desired name nodes. But now I have a rather quick solution for my problem. Thanks for the help! – niccip Aug 07 '22 at 17:39
1

You may try this:

awk '
    NR == FNR { a[$0]; next }
         /^$/ { if (snt in a) print rec; rec = snt = ""; next }
  /^# ::snt / { snt = $0 }
              { rec = rec $0 "\n" }
' list.txt paragraphs.txt

This assumes that records in paragraphs.txt are separated by empty lines as well as the last record ends with an empty line.

M. Nejat Aydin
  • 9,597
  • 1
  • 7
  • 17
1

Using any awk:

$ awk -F'\n' 'NR==FNR{list[$0]; next} $2 in list' list.txt RS= ORS='\n\n' paragraphs.txt
# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))

I'm setting RS and ORS for the 2nd file only as that's the one we want to read/print using paragraph mode but I'm setting FS for all input files to additionally make reading of the first file a bit more efficient as awk then won't waste time splitting each line into fields.

The main problem with your awk script is you were setting RS and ORS for all input files instead of only setting them for the second one. Also note that RS='\n\n' requires a version of awk that supports multi-char RS while RS='' will work in any awk, see https://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line.

Regarding the while read line; script in your question - see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for the issues with doing that. Also, in regards to '/x/' see Example of testing the contents of a shell variable as a regexp: at How do I use shell variables in an awk script?.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • time for this command on my data: 8,71s user 0,35s system 99% cpu 9,114 total – niccip Aug 07 '22 at 14:34
  • @niccip Was that 3rd-run timing? How does that compare to the 3rd-run timing for the other answers? Do they all produce the same output? It'd be very useful to us and others in future if you could edit your question to add the 3rd-run timing and output results (just pass/fail) for each of the answers to the end of it. – Ed Morton Aug 07 '22 at 14:35
  • time for the solution by M. Nejat Aydin on my data: 14,17s user 0,35s system 99% cpu 14,648 total – niccip Aug 07 '22 at 14:40
  • I see. I'm just curious - what is it you prefer about that slower solution that makes you tag it as your accepted answer? – Ed Morton Aug 07 '22 at 14:42
  • yes, this was 3rd (actually 4th) run timing for your solution and solution by M. Nejat Aydin. The solution by Mark is still running since 16 hours... ;-) – niccip Aug 07 '22 at 14:46
  • Nothing, it was the one I was trying after Mark's solution... I will change my vote to yours! ;-) – niccip Aug 07 '22 at 14:48
  • I'm not asking you to change it, I just wondered... Maybe you should hold off on accepting any answer for a few hours to see if you get more of them since each answer you've gotten so far has been faster than the last and accepting an answer discourages others from posting answers. – Ed Morton Aug 07 '22 at 14:50
  • @niccip I would hope you've killed that 16+hr running of my first (now removed) answer; timing on my 2nd answer (making assumptions about input file formats) should be quit a bit faster – markp-fuso Aug 07 '22 at 14:54
  • @markp-fuso in the meantime I've killed it... ;-) and added the timing of your second answer – niccip Aug 07 '22 at 15:08
  • @EdMorton thanks for that advice. I am new (at least as an active user here) and was just so excited to get a solution for my problem so fast after trying to find a solution for days... – niccip Aug 07 '22 at 15:10
  • You're welcome. Just remember to come back later and accept one (and bear in mind fastest isn't always best - there's lots of other criteria like robustness, clarity, portability, memory usage, etc. that go into good software!) :-)! – Ed Morton Aug 07 '22 at 15:10