1

First post and excited to be a part of this community.

I am a beginner and mainly use the command line for next-generation sequencing (NGS) analysis.

I have a list of files that contain data from a sequencer as follows:

[agh8423@quser12 all_fastq]$ ls Bio5* -al
-rw-r--r-- 1 agh8423 p30592 253029870 Jul 19 11:10 Bio5-H3K27ac-Dox-no_S5_L001_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 248177942 Jul 19 11:11 Bio5-H3K27ac-Dox-no_S5_L002_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 256860841 Jul 19 11:11 Bio5-H3K27ac-Dox-no_S5_L003_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 253399957 Jul 19 11:12 Bio5-H3K27ac-Dox-no_S5_L004_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 246636194 Jul 19 11:12 Bio5-H3K27ac-Dox-yes_S6_L001_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 242114964 Jul 19 11:13 Bio5-H3K27ac-Dox-yes_S6_L002_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 249862612 Jul 19 11:13 Bio5-H3K27ac-Dox-yes_S6_L003_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 247798281 Jul 19 11:14 Bio5-H3K27ac-Dox-yes_S6_L004_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 234917538 Jul 19 11:14 Bio5-H3K4me3-Dox-no_S3_L001_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 230571628 Jul 19 11:14 Bio5-H3K4me3-Dox-no_S3_L002_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 233025109 Jul 19 11:15 Bio5-H3K4me3-Dox-no_S3_L003_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 230268463 Jul 19 11:15 Bio5-H3K4me3-Dox-no_S3_L004_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 246254343 Jul 19 11:15 Bio5-H3K4me3-Dox-yes_S4_L001_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 241866406 Jul 19 11:16 Bio5-H3K4me3-Dox-yes_S4_L002_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 247044518 Jul 19 11:16 Bio5-H3K4me3-Dox-yes_S4_L003_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 243759599 Jul 19 11:17 Bio5-H3K4me3-Dox-yes_S4_L004_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 251009676 Jul 19 11:17 Bio5-Input-Dox-no_S1_L001_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 246054510 Jul 19 11:18 Bio5-Input-Dox-no_S1_L002_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 255798685 Jul 19 11:18 Bio5-Input-Dox-no_S1_L003_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 253896496 Jul 19 11:19 Bio5-Input-Dox-no_S1_L004_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 232179873 Jul 19 11:19 Bio5-Input-Dox-yes_S2_L001_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 227146014 Jul 19 11:19 Bio5-Input-Dox-yes_S2_L002_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 236543332 Jul 19 11:20 Bio5-Input-Dox-yes_S2_L003_R1_001.fastq.gz
-rw-r--r-- 1 agh8423 p30592 234698786 Jul 19 11:20 Bio5-Input-Dox-yes_S2_L004_R1_001.fastq.gz

If you notice, there are file names that are nearly identical except they differ in the "L001/2/3/4" portion of the file name. These are essentially replicate samples and for downstream processes I want to concatenate these files (but this information may not be relevant to my question)

WHAT I WANT: is to create a parent directory with the directory name being everything to the left of "_S(*)_L00(1/2/3/4)_Ri_001.fastq.gz" (so for example, the first file would have a directory named "Bio5-H3K27ac-Dox-no"). In addition to making this directory I want to put all of the files with the above file prefix (meaning all the L001/2/3/4 with the prefix name of Bio5-H3K27ac-Dox-no) into that new directory. The plan from there is to run zcat and concatenate the files into one file which would be easier to analyze.

Below is my attempt:

for file in ./*_L001_R1_001.fastq.gz.txt; do
    dir=${file%_L001_R1_001.fastq.gz.txt}
    mkdir -p "./$dir" &&
    mv -iv "$file" "./$dir"
    mv -iv "$dir"_L00* "./$dir"
done

And If I ls my directory I get the following.

[agh8423@quser11 test]$ ls -al
total 36
drwxrwsr-x 8 agh8423 p30592  4096 Jul 22 18:27 .
drwxrwsr-x 3 agh8423 p30592 32768 Jul 22 17:27 ..
drwxrwsr-x 2 agh8423 p30592  4096 Jul 22 18:27 Bio1-Input-Dox-no_S12
drwxrwsr-x 2 agh8423 p30592  4096 Jul 22 18:27 Bio1-Input-Dox-yes_S11
drwxrwsr-x 2 agh8423 p30592  4096 Jul 22 18:27 Bio1-MYC-Dox-no_S2
drwxrwsr-x 2 agh8423 p30592  4096 Jul 22 18:27 Bio1-MYC-Dox-yes_S3
drwxrwsr-x 2 agh8423 p30592  4096 Jul 22 18:27 Bio1-WDR5-Dox-no_S5
drwxrwsr-x 2 agh8423 p30592  4096 Jul 22 18:27 Bio1-WDR5-Dox-yes_S10
-rwxrwxr-x 1 agh8423 p30592   178 Jul 22 18:29 test1.sh

The part that I don't want is the _S12 etc. at the end of the directory name but I want it to remain in the file names that were moved to the new directories.

-Austin

  • Ok that sounds like fun, let us know how you do! But seriously, SO users will **help** you after you have worked on it, and are stuck on a problem that you could not find an answer for after doing your own research. For a script like this you need `mkdir`, `mv`, you could use `find`, do not parse `ls` output, ... – Nic3500 Jul 22 '18 at 21:19
  • Do not concatenate the files, it is not a good idea. Align them independently and put the proper read group in each file. Sort and merge the resulting bams. Analyze those bams that contain all your data for a single sample. – Poshi Jul 22 '18 at 22:40
  • Nic3500- Thanks for the advice for future posts. If you don't mind I'd love your feedback on my answer posted below. Thanks for pushing me to figure it out on my own. – gable_works Jul 22 '18 at 23:41
  • Poshi- Curent pipeline I have will align all four (L001/2/3/4) files and output the result into a single bam file. These L001/2/3/4 files are obtained from one sample but are more so technical replicates from one biological sample. Do you have reason that your method will offer anything different? Thanks! – gable_works Jul 22 '18 at 23:47
  • On one side, it makes more sense to keep different data (because that data is different, they were run in different lanes) easily identifiable, either by keeping them in separate files or by identifying each read individually in the bam file with the proper read group. – Poshi Jul 23 '18 at 07:55
  • On the other side, having this distintion makes it easy for you to stop different problems than can affect a single lane. One lane could be defective and you could be forced to discard that data. If you joined everything, then you cannot separate the faulty reads. Or, if you want to compute the duplicates ratio, fragments in different lanes could be considered as duplicates while they cannot physically be duplicates because they were physically separated. Many downstream analysis tools will rely on proper read group information, that's why they use it. – Poshi Jul 23 '18 at 07:57
  • BTW, nowadays are you working with a single end experiment? I though pair end experiments were really common these days :-? – Poshi Jul 23 '18 at 07:58
  • Poshi- I see what you mean. My exp is ChIP-seq so single end works well and the only advantage I know for paired-end ChIP-seq is a better approximation of the fragment lengths which could help with some peak-calling softwares. But I am new to NGS and may have a limited understanding. I agree that treating each lane of a flow cell independent could help downstream analysis. Thanks for your input! – gable_works Jul 23 '18 at 13:43

1 Answers1

3

Getting your proposal and refining it:

for file in ./*_L001_R1_001.fastq.gz.txt; do
    # $file will contain a relative folder and filename:
    # ./Bio5-H3K27ac-Dox-no_S5_L001_R1_001.fastq.gz
    # We are going to extract the filename and alter it to keep
    # the interesting part
    dir=$(basename "$file" | cut -f1 -d_)

    # Now, create the folder in the current workind directory
    mkdir -p "$dir"

    # Finally, move all the files that start with that
    # prefix to the new folder
    mv -iv "${dir}"* "$dir"
done

The last move command will throw an error because it will try to move $dir into $dir, which is not possible. But the other files will be moved and the job will be done. If you want a cleaner execution, you have to select the files you want to move (and exclude the folder, which you do not want to move):

find . -maxdepth 1 -type f -name "${dir}*" | xargs -n 1 -I{} mv {} "$dir"
tripleee
  • 175,061
  • 34
  • 275
  • 318
Poshi
  • 5,332
  • 3
  • 15
  • 32
  • 1
    You could use `mv -iv "$dir"?* "$dir"` to avoid the warning about moving a directory onto itself. The wildcard `?*` requires for there to be at least one character in the match. – tripleee Jul 23 '18 at 15:12
  • Errr... @tripleee, why did you added quotes to the vars? AFAIK, they are not needed. – Poshi Jul 23 '18 at 18:34
  • Variables containing file names should always be quoted. Scripts without quotes seem to work, then break horribly when you pass in a filename containing irregular whitespace, shell metacharacters, etc. This is a very common error and a source of many bugs. See https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable – tripleee Jul 23 '18 at 19:28
  • You can't predict who wants to copy/paste this code to a different scenario. Try http://shellcheck.net/ to diagnose this and many other problems, and follow the shell script tags on Stack Overflow for just a couple of days to see how common broken quoting bugs really are. – tripleee Jul 24 '18 at 06:51
  • Touchê! Moving to a new scenario without polishing for the new scenario could be potentially bad. I'll try to take care of these details in the future :-) – Poshi Jul 24 '18 at 07:21