0

I have two files (recode and reads) that were built and saved with nano command and I want to compare what has on recode to reads and extract the lines in reads that overlaps. I have been trying to create a when loop with the previous logic on mind, but without success so far. The output data is not matching with the pattern specified in the loop while with grep/recode. The script was supposed to read each line in recode.txt compare to reads.fastq, extract each match line plus one line before and 2 after in the reads.txt and save the output in different files (for all combined match lines per line of the recode.txt). Here are the tables and code:

File recode.txt:

GTGTCTTA+ATCACGAC
GTGTCTTA+ACAGTGGT
GTGTCTTA+CAGATCCA
GTGTCTTA+ACAAACGG
GTGTCTTA+ACCCAGCA
GTGTCTTA+AACCCCTC
GTGTCTTA+CCCAACCT
ATCACGAC+AAGGTTCA
GTGTCTTA+GAAACCCA

File reads.fastq:

###################################
@NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA
GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT
+
AAAAA#EAAEEEEE6EAEAEEEEEEEEEEEEEEEAEEEEEE/EEEEEEEEEE/EEEEEEEEEEEEEEEAEEEEEA
@NB500931:113:HW53WBGX2:1:11101:6116:1049 1:N:0:ACAAACGG+AAGGTTCA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###################################
@NB500931:113:HW53WBGX2:1:11101:6885:1049 1:N:0:ACCCAGCA+ACTTAGCA
GAGGGNGCTGTCCCAGTAATTGGGTTCAGATGACATTTGCTTGATTTTAGGGATGTACGAGATTTTCGTGGATC
+
AAA/A#EAEEEEEAEAEEA///EEAEEEEE///AEEAEE/AA//EAA<EEE/E//AEEEAAA//E/A<6//EEA
@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
+

Script:

#!/bin/bash
#PBS -l nodes=1:ppn=8,walltime=96:00:00

while read line
do
echo "working on $line"
grep -A3 "$line" reads.fastq | grep -v "^--$" >> "$line"_sorted.fastq
done<recode.txt

So, both files are in UNIX format and the following script (without a loop) works smooth

According to the script without the looping:

grep -A3 "ATCACGAC+AAGGTTCA" reads.fastq | grep -v "^--$" > sorted_file.fastq

my output should be:

            @NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA
   GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT
            +

    @NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
    CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
            +

However, my output using the loop while give me a empty file with the correct name. Can you please help me?

UPDATE: I have tried dos2unix to convert my files and it didn't work. UPDATE: I edited the question to include my expected output

BeGentle
  • 7
  • 3
  • The `-l` option to `grep` makes no sense when it's reading from a pipe rather than filenames. – Barmar Aug 11 '17 at 23:13
  • Thanks for your reply. I've used without it and even so didn't work. Any other thoughts? – BeGentle Aug 11 '17 at 23:22
  • Why do you need the `grep -v`? There are no `--` lines in the file. What actual problem are you having? – Barmar Aug 11 '17 at 23:24
  • Basically my issue is that the output is generated but I have nothing inside the files. However, it works perfectly when I place the PATTERN directly,(as demonstrated in the second script). The -v and -- are to avoid the -- (at least that is how I was told) – BeGentle Aug 11 '17 at 23:27
  • You need to use `grep -E` to get extended regular expressions, which you need for the `+` quantifier in your patterns. – Barmar Aug 11 '17 at 23:33
  • None of your gene patterns in `recode.txt` appear in the sample of `reads.fastq` – Barmar Aug 11 '17 at 23:36
  • Are you trying to match the `+` literally, or use it as a regexp quantifier? – Barmar Aug 11 '17 at 23:36
  • When I add `CCCAACCT+ACTTAGCA` to `recode.txt`, I get output in `CCCAACCT+ACTTAGCA_sorted.fastq`. – Barmar Aug 11 '17 at 23:38
  • Thanks, but it didn't work... – BeGentle Aug 11 '17 at 23:39
  • What didn't work? I just tried your script as it's written, and it worked for me when I added a matching line to `recode.txt`. – Barmar Aug 11 '17 at 23:40
  • I need to search for the NNNNNNN+NNNNNN as a word that must match with the NNNNNNN+NNNNNN after the N:0: of each sequence `################################### @NB500931:113:HW53WBGX2:1:11101:6885:1049 1:N:0:**ACCCAGCA+ACTTAGCA** GAGGGNGCTGTCCCAGTAATTGGGTTCAGATGACATTTGCTTGATTTTAGGGATGTACGAGATTTTCGTGGATC + AAA/A#EAEEEEEAEAEEA///EEAEEEEE///AEEAEE/AA//EAA – BeGentle Aug 11 '17 at 23:44
  • But have you tried with the loop `while`? – BeGentle Aug 11 '17 at 23:46
  • I just copied your script from the question and ran it. – Barmar Aug 11 '17 at 23:49
  • That is weird... because only works for me if I don't use the loop (in other words), just the second script works – BeGentle Aug 11 '17 at 23:51
  • I tried with the changes you just made, and I got a non-empty `ATCACGAC+AAGGTTCA_sorted.fastq` – Barmar Aug 11 '17 at 23:52
  • Did you create `recoded.txt` on Windows? Use `dos2unix` to fix the newlines. – Barmar Aug 11 '17 at 23:53
  • I created in excel from a mac OSX, but I have checked for unix format (it has $ at the end of each line) – BeGentle Aug 11 '17 at 23:58
  • What do you mean by `$` at the end? Do a hex dump and see if the line breaks are `0a` (Unix) or `0d0a` (Windows). – Barmar Aug 12 '17 at 00:04
  • On OS X, you can remove the CR characters with `tr -d '\r' recode.txt > recode.txt.fixed` – Barmar Aug 12 '17 at 00:06
  • Thanks! It is almost working now! there are some entries in the output, but they don't correspond to the list in recoded – BeGentle Aug 12 '17 at 01:00
  • 1
    @Barmar The `--` lines are from the `grep -A` output to separate line groups. – Benjamin W. Aug 12 '17 at 01:49
  • 2
    Your input has fields. grep does not support operations on fields so therefore any solution using grep will be an approximation of what you want and more complicated than it should be. awk, on the other hand, DOES support operations on fields. An awk solution will be far clearer, simpler, more efficient, and better in every other way than a shell loop with chains of greps in it. [edit] your solution to show the expected output given your posted sample input and someone will provide a robust, trivial awk script to do whatever it is you're trying to do. – Ed Morton Aug 12 '17 at 13:50
  • 1
    I would love to see this question improved and answered properly, but @BeGentle, you need to [edit your question](https://stackoverflow.com/posts/45644898/edit) and add clarification that has shown up in comments. Make an [MCVE](http://stackoverflow.com/help/mcve) so that people trying to answer can test their suggestions on your data and get the output you've said you're looking for. – ghoti Aug 12 '17 at 14:03

1 Answers1

1

Without seeing the expected output it's a guess but it sounds like this is what you're trying to do:

$ awk -F: 'NR==FNR{a[$0];next} $NF in a{c=3} c&&c--' recode.txt reads.fastq
@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
+

No shell loop required (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for SOME of the reasons why that matters), just saves the values from recode.txt as array indices and then when reading reads.fastq if the last :-separated field is an index of the array (i.e. existed in recode.txt) then set a counter to 3 and then print every line while the counter is greater than zero, decrementing the counter each time (see printing-with-sed-or-awk-a-line-following-a-matching-pattern for other examples of printing text starting from a match).

To save each found record in a file based on the string name in that final field as it looks like you might be trying to do in your shell loop would be:

awk -F: '
    NR==FNR  { a[$0]; next }
    $NF in a { c=3; close(out); out=$NF"_sorted.fastq" }
    c&&c--   { print >> out }
' recode.txt reads.fastq

Note that that just reads "reads.fastq" once total, not once per line of "recode.txt" as your shell loop was doing, so you can expect a vast performance improvement from that aspect alone.

Finally - if recode.txt is just a list of ALL of the final fields that exist in reads.fastq then you simply don't need it, this is all you need to split reads.fastq into separate files of 3 lines per record named based on the value after the last : on each line that starts with @:

awk -F: '
    /^@/   { c=3; close(out); out=$NF"_sorted.fastq" }
    c&&c-- { print >> out }
' reads.fastq
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Hi @Ed Morton, many thanks for your reply. I have edited my question as you suggested and my aim would be exactly what you wrote in your second script. I ran it and I gave me no outputs .. – BeGentle Aug 12 '17 at 16:36
  • Your question shows input where every line that starts with `@` starts at the beginning of a line and output where inexplicably every line that starts with `@` is indented by some number of spaces. You say you want the matching line + the succeeding 2 lines (you tell us `grep -A3` works) but then your posted expected output includes 1 line before the matching line. I **show** in my answer running the script against your posted input and it producing output yet you say it produces no output so your posted input must not match your real input. Idk how we're supposed to help you given all of that – Ed Morton Aug 12 '17 at 17:20
  • Hi, my working/real files have the same composition of the ones here - it is just way more big. Apart from that there is no difference. I have copied the files from here to a nano file and then I ran your script. It works well. However, once I do the same for my working files, I have no output. I have used dos2unix already and really don't know what else could be. Do you have any ideas? Thanks again – BeGentle Aug 12 '17 at 18:30
  • I have fixed the expected output – BeGentle Aug 12 '17 at 18:48
  • Neither the grep script you posted nor the awk script I posted,both of which you say work for you, would produce the expected output you posted from the input you posted so between the inconsistencies in your question and not seeing any sample data that has the failures you're asking for help to debug - sorry but no, I have no ideas other than suggesting you clean up your question and work on reducing your real data set to a minimal example that can be used to reproduce the problem and then posting that. – Ed Morton Aug 12 '17 at 19:10
  • Hi @Ed Morton, I have improved the output and the code.. could you help to select the one line before and 2 after the match lines? – BeGentle Aug 16 '17 at 16:22
  • 1
    Sorry, I'm new here (in fact, that was my first question ever here). So, I wasn't familiar about this situation and the good practices here. Sorry about it. I have edited it. Apart from it, I have included your 'awk' script and it is working. Many thanks – BeGentle Aug 16 '17 at 18:16