List full lines based on single field much faster than grep

Question

I need something {bash?} to accomplish the following but much faster

grep -w -f position.txt build37.txt > genetic.map

-w whole words only -otherwise 55550 would include 17555508, 26155550 etc out of order, or not wanted; position.txt has 34,034 lines {numbers} in 1 column; build37.txt has 3,303,900 lines in 4 columns; the entire line is required in the order they occur. genetic.map when completed will have 34,034 lines in 4 columns

EXAMPLES:

position.txt
{Line#1:} 14228077

build37.txt
{Line#12,644:} chr1 14228077    6.339762    29.633830

genetic.map
{Line#1:} chr1  14228077    6.339762    29.633830

Thank you!

-MORE-

build37.txt: {First few lines}

Chromosome  Position(bp)    Rate(cM/Mb) Map(cM)
chr1    55550   2.981822    0.000000
chr1    82571   2.082414    0.080572
chr1    88169   2.081358    0.092229
chr1    254996  3.354927    0.439456
chr1    564598  2.887498    1.478148
chr1    564621  2.885864    1.478214
chr1    565433  2.883892    1.480558
chr1    568322  2.887570    1.488889
chr1    568527  2.895420    1.489481
chr1    721290  2.655176    1.931794
chr1    723819  2.669992    1.938509
chr1    728242  2.671779    1.950319
chr1    729948  2.675202    1.954877

positions.txt: {contrived as example}

genetic.map {desired}

chr1    82571   2.082414    0.080572
chr1    564621  2.885864    1.478214
chr1    565433  2.883892    1.480558
chr1    721290  2.655176    1.931794

My apologies! There are 569 duplicates within the position column {number two} of build37.txt. I would need two identifiers: In order to obtain the correct lines.

chr1  123456
chr6  123456

I have tried all of the solutions suggested ... Perhaps because I was wrong about my reference data which is better queried using TWO fields rather than ONE, the results were 357-569 lines longer than asked-for and expected

I moved my project to windows {XP} and had better results with:

findstr /g:chr.pos.txt build37.txt > genetic.map

The results were 44-lines longer than asked-for and expected {better anyway}
FINDSTR: /C ignored /L made no difference /R might be more exact but processed slowly @ 71-lines per minute in > genetic.map

A discussion of poorly documented findstr features at: What are the undocumented features and limitations of the Windows FINDSTR command?

chr.pos.txt:

chr1    14228077
chr1    14228490  
...
chr22   49783510
chr22   49784152

Good that you have shown your attempted code. Could you please show sample of input and expected output in your question and let us know then. — RavinderSingh13, Feb 29 '20 at 17:13
Don't use grep for this as grep doesn't know anything about fields and so you'll get false matches when, for example, you're looking for `675202` in the "position" field but even with `grep -wF` that'd match `2.675202` or similar in the Rate or Map fields or any other part of the line since `.` isn't a word constituent character so `2.675202` is considered 2 words by the grep engine - `2`, and `675202` — Ed Morton, Feb 29 '20 at 21:03
I understood that the format of the file `position.txt` changed from one to two fields. After a deeper analysis, you changed the problem. So, you should start a new question. Please provide enhough lines of `position.txt` and `build37.txt` to check the solutions we provide match your desired output `genetic.map`. — Pierre François, Mar 03 '20 at 11:07

Pierre François · Answer 1 · 2020-02-29T17:40:18.380

3

The solution I proposed above with fgrep will not make a big difference. It is better to use the join tool, if it is OK to sort the files position.txt and build37.txt.

join -1 1 -2 2 <(sort -k 1 position.txt) <(sort -k 2 build37.txt) | awk '{print $2, $1, $3, $4}'

It could be possible to test this solution if you could provide a little subset of the files position.txt and build37.txt.

edited Feb 29 '20 at 17:40

answered Feb 29 '20 at 17:33

Pierre François

5,850
1
17
38

1

Instead of `| awk '{print $2, $1, $3, $4}'` just do `join -o 1.1,1.2,1.3,1.4`. If `position.txt` would have a header like `build37.txt`, we could also use `join --header`. – KamilCuk Feb 29 '20 at 20:58

score 1 · Answer 2 · answered Feb 29 '20 at 17:15

1

Try:

fgrep -w -f position.txt build37.txt > genetic.map

fgrep is faster than grep when the pattern you are matching is not a regular expression but a fixed string, as in the example you provided where you search the string 14228077.

answered Feb 29 '20 at 17:15

Pierre François

5,850
1
17
38

I think GNU grep, when given a pattern that has no regex special characters, defaults to `grep -F` behaviour. – Benjamin W. Feb 29 '20 at 17:41
Yes, fgrep works VERY fast! grep was doing 21 lines a minute on my system, fgrep did 33,000 lines in a minute. I have 569 additional lines, presumably as the position number was inadvertently pulling from other columns. I would post compressed files if that were possible, I don't know that a random subset would be useful as there is a 97:1 correspondence between the two files. I don't think the overage is fatal, but an exact solution would probably involve file manipulation that I am unfamiliar with, thank you for assistance. – user3193737 Feb 29 '20 at 18:15
The join | awk solution missed one line but added 357 un-needed lines, all of them out of order. I suppose the matching item from position.txt should only "look" in column two of build37.txt , thank you for assistance. – user3193737 Feb 29 '20 at 18:36
If you want to comment the `join ... | awk ...` solution, I advise you to write your comments under that solution for the sake of clarity. The "out of order" issue will probably be related with a problem of sorting, but with the only entry you provide as input, I cannot figure out why the sort didn't work completely. For fixing that, we need more information about the format of the input than only one entry. – Pierre François Feb 29 '20 at 19:43
1

`fgrep` is deprecated, use `grep -F` instead. Don't do it all though for this problem because you'll get false matches since it'll be looking for `1234` across the whole line so if `7.1234` appeared in some other column than the "position" one then it'd be a match since `.` isn't a word-constituent character. – Ed Morton Feb 29 '20 at 20:53
Thank you everybody. I have amended my question above with my latest efforts. I have not yet found a perfect solution. – user3193737 Mar 02 '20 at 12:47
In the examples of input and output you provide, please foresee the case of the same *Position* value (in field 2 of `build37.txt`) corresponding to different *Chromosome* values (in field 1 of `build37.txt`). – Pierre François Mar 02 '20 at 13:42
In the Reference map Build37.txt there can only be ONE chr1 POSITION 12345; but there can be a chr1 POSITION 123456 and chr1 POSITION 1234567 which I assume could be printed into the > file. Any findstr modifier for "exact" is either ignored or will add unacceptable time to the query. – user3193737 Mar 02 '20 at 14:04
@user3193737 [the answer I provided](https://stackoverflow.com/a/60469524/1745001) will produce exactly the output you provided from the 2 input files you provided and best I can tell will continue to work for the modified input with duplicates that you describe. So - in what way is that not doing what you want? Assuming it isn't working for you somehow update your question to provide sample input/output that the script I posted fails for. – Ed Morton Mar 02 '20 at 14:25
awk solution: using the original single-field position.txt extracts 563 additional un-needed lines; if I substitute the two-field chr.pos.txt with the tab between the two fields as exists in the ref file I get an empty file. I suspect the code would have to be updated to "pull" from both fields, no? – user3193737 Mar 02 '20 at 14:57
Open a new question about two fields in `position.txt` and I guess you will get a bunch of solutions. – Pierre François Mar 04 '20 at 09:48

score 1 · Answer 3 · answered Feb 29 '20 at 20:52

1

You should be much more worried about accuracy than efficiency when trying to use grep to match on a single field since grep doesn't have any concept of "fields". Just use awk:

$ awk 'NR==FNR{pos[$1]; next} $2 in pos' position.txt build37.txt
chr1    82571   2.082414    0.080572
chr1    564621  2.885864    1.478214
chr1    565433  2.883892    1.480558
chr1    721290  2.655176    1.931794

That'll be fast and robust since it's doing a hash lookup using exactly/only the strings that appear in the positions column of build37.txt against exactly/only the contents of position.txt.

answered Feb 29 '20 at 20:52

Ed Morton

188,023
17
78
185

Thank you... My apologies! There are 569 duplicates within the position column {number two} of build37.txt. I would need two identifiers: In order to obtain the correct lines. – user3193737 Feb 29 '20 at 22:02
Did you try what I posted? It doesn't care about duplicates and best I can tell should do exactly what you want. – Ed Morton Mar 01 '20 at 02:35
@EdMorton: Can you give more explanation about what your **awk** command exactly does. I don't find in the manual what is happening with two file names as argument of the command. – Pierre François Mar 02 '20 at 13:56
awk is reading both input files. When it reads the first one the condition NR==FNR is true (man page tells you what they mean) and awk populates the array `pos[]`. When it reads the 2nd one that condition is false soit tests to see if the current $2 from the 2nd file is present in `a[]` and if so prints the current line from the 2nd file. It's an **extremely** common awk idiom so good to know. – Ed Morton Mar 02 '20 at 14:21
@EdMorton: I still do not understand what `$2 in pos` does. Sorry, but I don't find it in the docs of awk. – Pierre François Mar 03 '20 at 11:24
@PierreFrançois It tests whether or not `$2` is an index `in` the array `pos`. See https://www.gnu.org/software/gawk/manual/gawk.html#Reference-to-Elements and https://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array for information on the `in` operator. Actually, it'd probably be good if you just read that whole section on arrays - https://www.gnu.org/software/gawk/manual/gawk.html#Arrays – Ed Morton Mar 03 '20 at 15:50
1

@EdMorton: thank you, my knowledge of **awk** is not far above the average and I didn't link `in` with arrays in such a context, but I am interested to know more about it. Now I understand that your line of code include 2 instructions where the default action of the second is `print $0`. I also understand why you need a `next` instruction, to prevent from executing the second instruction as long as you read the first input file. Tough... – Pierre François Mar 04 '20 at 09:46

List full lines based on single field much faster than grep

3 Answers3