Subsetting GWAS results by matching snp column from another file

Question

I have a GWAS summary estimate file with the following columns (file 1):

1   chr1_1726_G_A      0.023  0.160
1   chr1_20184_GAATA_G 0.033  0.180
1   chr1_791101_T_TGG  0.099  0.170

file 2

chr1_20184_GAATA_G
chr1_791101_T_TGG

I would like to match the column1 of file 2 with column 2 of file1 to create a file 3 such as:

1   chr1_20184_GAATA_G 0.033  0.180
1   chr1_791101_T_TGG  0.099  0.170

By using the below code, I get an empty file3:

awk 'FNR==NR{arr[$2];next} (($2) in arr)' file2 file1 > file3

Why not `fgrep` ? Seems to me that it's the dedicated tool for this kind of filter. — Zilog80, Jun 17 '21 at 09:37
@Zilog80 using `grep` for field based processing is tricky. Even assuming that content like `chr1_2018` can occur only in the second field, you'll still need to add conditions to avoid partial match (for ex: `chr1_2018` shouldn't match a field having `chr1_2018a`). And if the number of entries in file 2 is huge, `grep` will be slower compared to hash based matching in tools like `awk` — Sundeep, Jun 17 '21 at 10:08
@Zilog80 `fgrep` is deprecated in favor of `grep -F` and it's more cumbersome to use any *grep in specific fields than it is to just use awk since grep is line-oriented while awk is record-and-field oriented. That becomes especially true when you want to do a string match because grep has to use a regexp to isolate the field so then you'd have to escape every regexp metachar in the text you're using for the match to try to get it to act as if it were a string. Very cludgy and messy. — Ed Morton, Jun 17 '21 at 12:42
@AVA count how many fields are present on each line in `file2`. It's just 1 field, right? Now look at your code that reads file2 (`FNR==NR{arr[$2]...`). It's trying to use field 2 of a file that has 1 field per line. Think about that. — Ed Morton, Jun 17 '21 at 12:46
@EdMorton Sundeep, noted. I'm an old one, i see that (funny, i _know_ the _-F_ grep option, but still think of _fgrep_. Old habits are the hardest to divert). The provided data in the question and the provided `awk` script let me see the matching field as some kind of unique key. I'm fond of `awk`, and i see your point regarding line versus record. In this case, as source is a simple delimited file, the _grep -F_ way seems not so unappropriate. But i agree that for anything other than one shot script, it's better to rely on `awk` for reliability and efficiency. — Zilog80, Jun 17 '21 at 13:32
@Zilog80 if you'd like to post an answer using `grep -F` we can point out the specific issues. It's one of these things that only **sounds** simple until you actually try to do it and then you end up having to consider and code around various issues. It's just not a useful approach. — Ed Morton, Jun 17 '21 at 14:10
@EdMorton If it's flawed then a comment should be enough. I would have go with `sed "s/^/\t/;s/$/\t/" file2.txt | grep -F -f - file1.txt` as it is apparently tab delimited csv file. It's not very efficient from performance point with large files i agree, and unicode keys may be of some troubles. And if there is metachar in the keys, you may need escaping them., however OP use case doesn't seems to be subject to that. — Zilog80, Jun 17 '21 at 16:16
For reference, the [file format specification](https://github.com/MRCIEU/gwas-vcf-specification) is a tabulated csv type. _"There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (‘.’)."_, OP file1 is apparently a subset with 4 fields. — Zilog80, Jun 17 '21 at 16:29
@Zilog80 obviously you're not just using `grep -F` with that, you're using `sed` + a pipe + `grep -F`. It looks to me like the OPs example is fixed-width-fields, though and I googled gwas file format and found [A GWAS file is a space- or tab-delimited result file](https://software.broadinstitute.org/software/igv/GWAS) so YMMV with assuming its tab-separated. It'd obviously fail if some of the strings from file2 could appear as other fields in file1 but that doesn't seem likely given the OPs sample input. Your sed command could just be `sed 's/.*/\t&\t/'` by the way. — Ed Morton, Jun 17 '21 at 16:43
@Edmonton Nice compact way for the `sed` ^^. Do you think that's worth an answer ? Maybe OP has some concern about the file so it should wait its feedback. — Zilog80, Jun 17 '21 at 17:04
@Zilog80 thanks. No, IMHO the OP should just use a single call to awk for this or anything else involving fields. In fact I think it's true to say that any time you're considering using sed + grep you'll get a better (for some definition of "better") script if you either just use sed or just use awk, depending on the task at hand. — Ed Morton, Jun 17 '21 at 17:31
Thank you for the above discussion. The use of sed 's/.*/\t&\t/' file 2.txt | grep -F -f - file.txt did the job quite quickly. — AVA, Jun 22 '21 at 08:28

score 3 · Accepted Answer · answered Jun 17 '21 at 09:20

With your shown samples, please try following awk code.

awk 'FNR==NR{arr[$0];next} ($2 in arr)' file2 file1

OR

awk 'FNR==NR{arr[$1];next} ($2 in arr)' file2 file1

Explanation: Use $0(in 1st solution) OR $1(in OR solution) for array rather than using $2 in first block and then rest of your code is fine to match; matching records here.

Subsetting GWAS results by matching snp column from another file

1 Answers1