3

I've got two files (I only show the beginning of these files) :

patterns.txt

m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147

myfile.txt

>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA

I should get an output like that :

>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA

I want to create a new file if the lines in patterns.txt match with the lines in myfile.txt . I need to keep the letters ACTG associated with the pattern in question. I use :

for i in $(cat patterns.txt); do 
     grep -A 1 $i myfile.txt; done > my_newfile.txt

It works, but it's very slow to create the new file... The files I work on are pretty large but not too much (14M for patterns.txt and 700M for myfile.txt).

I also tried to use grep -v because I have the another file which contains the others patterns of myfile.txt not present in patterns.txt. But it is the same "speed filling file" problem.

If you see a solution..

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
Paillou
  • 779
  • 7
  • 16
  • 1
    Please add your desired output (no description) for that sample input to your question (no comment). – Cyrus Feb 18 '21 at 16:42
  • when you say `14M for patterns.txt` do you mean there are `14M(illion lines)` in the file or the file size is `14M(Bytes)`? – markp-fuso Feb 18 '21 at 17:07
  • 2
    It is in Bytes. The comment below answered my question. – Paillou Feb 18 '21 at 17:10
  • 1
    @tripleee, with whole respect; I would like to raise this here, IMHO added link can definitely provide logic of doing this(up to some extent) but this isn't a exact duplicate of question(since data of OP is different from data shown in attached link); so I have reopened question now, thank you. – RavinderSingh13 Feb 18 '21 at 17:28
  • 1
    @tripleee, IMHO, duplicate means "exact duplicate" NOT like which only gives a little guidance or so. Else it doesn't make sense make anything dupe if it isn't exact dupe. I am also in favor of making a question dupe but when its exactly dupe. I feel I have kept my point politely before re-opening this one, thank you. – RavinderSingh13 Feb 18 '21 at 17:41
  • To quote https://meta.stackexchange.com/a/10844/169312: *"Questions may be duplicates if they have the same (potential) answers. This includes not only word-for-word duplicates, but also the same idea expressed in different words.*" – tripleee Feb 18 '21 at 17:43
  • 2
    @tripleee, I didn't mean word by word duplicate(off course we all understand it, we are seeing things Logic wise here not line by line), Logic here is also different for the attached one and this one. If you compare the question, answers there and here then I think I need not to explain it more, thank you. – RavinderSingh13 Feb 18 '21 at 17:45

2 Answers2

7

With your shown samples please try following. Written and tested in GNU awk.

awk '
FNR==NR{
  arr[$0]
  next
}
/^>/{
  found=0
  match($0,/.*\//)
  if((substr($0,RSTART+1,RLENGTH-2)) in arr){
    print
    found=1
  }
  next
}
found
'  patterns.txt myfile.txt

Explanation: Adding detailed explanation for above.

awk '                         ##Starting awk program from here.
FNR==NR{                      ##Checking condition which will be TRUE when patterns.txt is being read.
  arr[$0]                     ##Creating array with index of current line.
  next                        ##next will skip all further statements from here.
}
/^>/{                         ##Checking condition if line starts from > then do following.
  found=0                     ##Unsetting found here.
  match($0,/.*\//)            ##using match to match a regex to till / in current line.
  if((substr($0,RSTART+1,RLENGTH-2)) in arr){  ##Checking condition if sub string of matched regex is present in arr then do following.
    print                     ##Printing current line here.
    found=1                   ##Setting found to 1 here.
  }
  next                        ##next will skip all further statements from here.
}
found                         ##Printing the line if found is set.
'  patterns.txt myfile.txt    ##Mentioning Input_file names here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
4

Another awk:

$ awk -F/ '                            # / delimiter
NR==FNR {
    a[$1,$2]                           # hash patterns to a
    next
}
{
    if( tf=((substr($1,2),$2) in a) )  # if first part found in hash
        print                          # output and store found result in var tf
    if(getline && tf)                  # read next record and if previous record was found
        print                          # output
}' patterns myfile

Output:

>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA

Edit: To output the ones not found:

$ awk -F/ '                              # / delimiter
NR==FNR {
    a[$1,$2]                             # hash patterns to a
    next
}
{
    if( tf=((substr($1,2),$2) in a) ) {  # if first part found in hash
        getline                          # consume the next record too
        next
    }
    print                                # otherwise output
}' patterns myfile

Output:

>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
James Brown
  • 36,089
  • 7
  • 43
  • 59