I need something {bash?} to accomplish the following but much faster
grep -w -f position.txt build37.txt > genetic.map
-w whole words only -otherwise 55550 would include 17555508, 26155550 etc out of order, or not wanted; position.txt has 34,034 lines {numbers} in 1 column; build37.txt has 3,303,900 lines in 4 columns; the entire line is required in the order they occur. genetic.map when completed will have 34,034 lines in 4 columns
EXAMPLES:
position.txt
{Line#1:} 14228077
build37.txt
{Line#12,644:} chr1 14228077 6.339762 29.633830
genetic.map
{Line#1:} chr1 14228077 6.339762 29.633830
Thank you!
-MORE-
build37.txt: {First few lines}
Chromosome Position(bp) Rate(cM/Mb) Map(cM)
chr1 55550 2.981822 0.000000
chr1 82571 2.082414 0.080572
chr1 88169 2.081358 0.092229
chr1 254996 3.354927 0.439456
chr1 564598 2.887498 1.478148
chr1 564621 2.885864 1.478214
chr1 565433 2.883892 1.480558
chr1 568322 2.887570 1.488889
chr1 568527 2.895420 1.489481
chr1 721290 2.655176 1.931794
chr1 723819 2.669992 1.938509
chr1 728242 2.671779 1.950319
chr1 729948 2.675202 1.954877
positions.txt: {contrived as example}
82571
564621
565433
721290
genetic.map {desired}
chr1 82571 2.082414 0.080572
chr1 564621 2.885864 1.478214
chr1 565433 2.883892 1.480558
chr1 721290 2.655176 1.931794
My apologies! There are 569 duplicates within the position column {number two} of build37.txt. I would need two identifiers: In order to obtain the correct lines.
chr1 123456
chr6 123456
I have tried all of the solutions suggested ... Perhaps because I was wrong about my reference data which is better queried using TWO fields rather than ONE, the results were 357-569 lines longer than asked-for and expected
I moved my project to windows {XP} and had better results with:
findstr /g:chr.pos.txt build37.txt > genetic.map
The results were 44-lines longer than asked-for and expected {better anyway}
FINDSTR: /C ignored /L made no difference /R might be more exact but processed slowly @ 71-lines per minute in > genetic.map
A discussion of poorly documented findstr features at: What are the undocumented features and limitations of the Windows FINDSTR command?
chr.pos.txt:
chr1 14228077
chr1 14228490
...
chr22 49783510
chr22 49784152