How to use awk for multiple file search in two directories, print records only from files with matching string in second directory

Question

Remade a previous question so that it is more clear. I'm trying to search files in two directories and print matching character strings (+ line immediately following) into a new file from the second directory only if they match a record in the first directory. I have found similar examples but nothing quite the same. I don't know how to use awk for multiple files from different directories and I've tortured myself trying to figure it out.

Directory 1, 28,000 files, formatted viz.:

>ABC
KLSDFIOUWERMSDFLKSJDFKLSJDSFKGHGJSNDKMVMFHKSDJFS
>GHI
OOILKJSDFKJSDFLMOPIWERIOUEWIRWIOEHKJTSDGHLKSJDHGUIYIUSDVNSDG

Directory 2, 15 files, formatted viz.:

>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>DEF
12341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234

Desired output:

>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234

Directories 1 and 2 are located in my home directory: (./Test1 & ./Test2)

If anyone could advise command to specific the different directories, I'd be immensely grateful! Currently when I include file path (e.g., /Test1/*.fa) I get the following error:

awk: can't open file /Test1/*.fa

score 0 · Accepted Answer · edited May 23 '17 at 11:52

0

You'll want something like this (untested):

awk '
FNR==1 {
    dirname = FILENAME
    sub("/.*","",dirname)
    if (NR==1) {
        dirname1 = dirname
    }
}
dirname == dirname1 {
    if (FNR % 2) {
        key = $0
    }
    else {
        map[key] = $0
    }
    next
}
(FNR % 2) && ($0 in map) && !seen[$0,map[$0]]++ {
    print $0 ORS map[$0]
}
' Test1/* Test2/*

Given you're getting the error message /usr/bin/awk: Argument list too long which means you're exceeding your shells maximum argument length for a command and that 28,000 of your files are in the Test1 directory, try this:

find Test1 -type f -exec cat {} \; |
awk '
NR == FNR {
    if (FNR % 2) {
        key = $0
    }
    else {
        map[key] = $0
    }
    next
}
(FNR % 2) && ($0 in map) && !seen[$0,map[$0]]++ {
    print $0 ORS map[$0]
}
' - Test2/*

edited May 23 '17 at 11:52

Community

1
1

answered May 26 '16 at 23:27

Ed Morton

188,023
17
78
185

1

So far seems to be working – have a reduced number of files for testing on my laptop – will check on the full dataset when I'm back in the office tomorrow and follow up – thank you so much! – MoGo May 27 '16 at 02:52
I have to say gotta say – this is marvelous! I have only encountered one hang up, which is that my full dataset gives an error "/usr/bin/awk: Argument list too long". I tried to pipe it into `xargs` and I get the same error. I've already copied the >28,000 files into another directory to do it the slower way, but I wanted to ask if there might be another trick to get around this? Just to add another information resource for future efforts (this is a script I'll likely use more often). – MoGo May 27 '16 at 16:11
You're exceeding your shells maximum argument length for a command so you'll get that same error with any command (ls, cat, xargs, whatever). Let me think about it a bit to see if I can come up with a workaround (google results didn't produce anything useful in this particular case). – Ed Morton May 27 '16 at 19:14
OK I added a possible solution, try and and let us know if it works for you. – Ed Morton May 27 '16 at 19:21
1

The find function works like a charm! I would have spent several days trying to figure this out, I really can't thank you enough! – MoGo May 27 '16 at 21:32

score 0 · Answer 2 · answered May 27 '16 at 02:52

Solution in TXR:

Data:

$ ls dir*
dir1:
file1  file2

dir2:
file1  file2

$ cat dir1/file1
>ABC
KLSDFIOUWERMSDFLKSJDFKLSJDSFKGHGJSNDKMVMFHKSDJFS
>GHI
OOILKJSDFKJSDFLMOPIWERIOUEWIRWIOEHKJTSDGHLKSJDHGUIYIUSDVNSDG

$ cat dir1/file2
>XYZ
SDOIWEUROIUOIWUEROIWUEROIWUEROIWUEROUIEIDIDIIDFIFI
>MNO
OOIWEPOIUWERHJSDHSDFJSHDF

$ cat dir2/file1
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>DEF
12341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234

$ cat dir2/file2
>STP
12341234123412341234123412341234123412341234123412341234123412341234123412341234
>MNO
123412341234123412341234123412341234123412341234123412341234123412341234
$

Run:

$ txr filter.txr dir1/* dir2/*
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234
>MNO
123412341234123412341234123412341234123412341234123412341234123412341234

Code in filter.txr:

@(bind want @(hash :equal-based))
@(next :args)
@(all)
@dir/@(skip)
@(and)
@  (repeat :gap 0)
@dir/@file
@    (next `@dir/@file`)
@    (repeat)
>@key
@      (do (set [want key] t))
@    (end)
@  (end)
@(end)
@(repeat)
@path
@  (next path)
@  (repeat)
>@key
@datum
@    (require [want key])
@    (output)
>@key
@datum
@    (end)
@  (end)
@(end)

To separate the dir1 paths from the rest, we use an @(all) match (try multiple pattern branches, which must all match) with two branches. The first branch matches one @dir/@(skip) pattern, binding the variable dir to text that is preceded by a slash, and ignore the rest. The second branch matches a whole consecutive sequence of @dir/@file patterns via @(repeat :gap 0). Because the same dir variable appears that already has a binding from the first branch of the all, this constrains the matches to the same directory name. Inside this repeat we recurse into each file via next and gather the >-delimited keys into the keep hash. After that, we process the remaining arguments as path names of files to process; they don't all have to be in the same directory. We scan through each one for the >@key pattern followed by a line of @datum. The @(require ...) directive will fail the match if key is not in the wanted hash, otherwise we fall through to the @(output).

Thank you so much! Just heading off for the evening but will check this first thing in the morning from office and follow up – I really appreciate your thorough explanation of the syntax. I have no used TXR before – something new to learn and your well-annotated solution here will make it much easier to add to my repertoire! — MoGo, May 27 '16 at 02:56

How to use awk for multiple file search in two directories, print records only from files with matching string in second directory

Directory 1, 28,000 files, formatted viz.:

Directory 2, 15 files, formatted viz.:

Desired output:

2 Answers2