Questions tagged [fastq]

FASTQ files are used in bioinformatics to store sequence information and sequencing quality scores.

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

[Wikipedia]

257 questions
23
votes
4 answers

faster membership testing in python than set()

I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set() ? import sys #load ids ids = set( x.strip() for x in open(idfile) ) for…
Leszek
  • 1,290
  • 2
  • 11
  • 21
19
votes
14 answers

Converting FASTQ to FASTA with SED/AWK

I have a data in that always comes in block of four in the following format (called FASTQ): @SRR018006.2016 GA2:6:1:20:650 length=36 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN +SRR018006.2016 GA2:6:1:20:650…
neversaint
  • 60,904
  • 137
  • 310
  • 477
12
votes
2 answers

How do I use parallel programming/multi threading in my bash script?

This is my script: #!/bin/bash #script to loop through directories to merge fastq files sourcedir=/path/to/source destdir=/path/to/dest for f in $sourcedir/* do fbase=$(basename "$f") echo "Inside $fbase" zcat $f/*R1*.fastq.gz |…
Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
10
votes
4 answers

bash: /bin/ls: Argument list too long

I need to make a list of a large number of files (40,000 files) like below: ERR001268_1_100.fastq ERR001268_2_156.fastq ERR001753_2_78.fastq ERR001268_1_101.fastq ERR001268_2_157.fastq ERR001753_2_79.fastq ERR001268_1_102.fastq …
LookIntoEast
  • 8,048
  • 18
  • 64
  • 92
9
votes
3 answers

Read list of files on unix and run command

I am pretty new at shell scripting and I have been struggling all day to figure out how to perform a "for" command. Essentially, what I am trying to do is the following: I have a list.txt file with a bunch of names: name1 name2 name3 for every name…
user2647734
  • 127
  • 1
  • 1
  • 5
6
votes
2 answers

regex: matching several patterns derived from a simple string

I have following task: Starting with 30 character long pattern sequence (it is actually DNA sequence, lest call it P30) I need to find in a text file all lines starting (^agacatacag... )with a exact P30, then with 29 last characters of the 30, 28…
darked89
  • 332
  • 1
  • 2
  • 17
6
votes
4 answers

How can I make my Python script faster?

I'm pretty new to Python, and I have written a (probably very ugly) script that is supposed to randomly select a subset of sequences from a fastq-file. A fastq-file stores information in blocks of four rows each. The first row in each block starts…
Sandra
  • 71
  • 3
5
votes
0 answers

Large files for GitHub CICD

I have a GitHub repo of a pipeline that requires very large files as input (basic test datasets would be around 1-2 Gb). I thought about circunventing this by doing CICD locally, but this will not allow the CICD to run if other people want to…
4
votes
3 answers

Filter sequences with more than 8 same consecutive nucleotides in a fastq file?

I want to filter my sequences which has more than 8 same consecutive nucleotides like "GGGGGGGG", "CCCCCCCC", etc in my fastq files. How should I do that?
Dawud
  • 51
  • 3
4
votes
4 answers

Map files into memory

I will explain what's my problem first, as It's important to understand what I want :-). I'm working on a python-written pipeline that uses several external tools to perform several genomics data analysis. One of this tools works with very huge…
guillemch
  • 323
  • 2
  • 14
3
votes
3 answers

Grep that tolerates mismatches to subset .fastq

I am working with bash on a linux cluster. I am trying to extract reads from a .fastq file if they contain a match to a queried sequence. Below is an example .fastq file containing three reads. $ cat example.fastq @SRR1111111.1…
Paul
  • 656
  • 1
  • 8
  • 23
3
votes
1 answer

Renaming interleaved fastq headers with biopython

For ease of use and compatibility with another downstream pipeline, I'm attempting to change the names of fastq sequence ids using biopython. For example... going from headers that look like this: @D00602:32:H3LN7BCXX:1:1101:1205:2112…
Gunther
  • 129
  • 7
3
votes
0 answers

Error sh: 1: fastqc: not found while calling fastqc

I have checked many times that fastqc is installed in bin folder and library("fastqcr") is also not giving any error still I am getting error of sh: 1: fastqc: not found for the following command fastqc(fq.dir = "~/WES_Pipeline/Data", #…
Lot_to_learn
  • 590
  • 2
  • 9
  • 21
3
votes
3 answers

How can I do a transparent gzip uncompress from both stdin and files in perl?

I've written a few scripts for processing FASTA/FASTQ files (e.g. fastx-length.pl), but would like to make them more generic and accept both compressed and uncompressed files as both command line parameters and as standard input (so that the scripts…
gringer
  • 410
  • 4
  • 13
3
votes
2 answers

Efficient way to TRANSLATE every Nth string in bash or R

Thank you for taking the time to look at this. I have a fastq file and I want to translate it to the complementary, but not the reverse complementary, something like this: @Some header example:1: ACTGAGACTCGATCA + S0m3_Qu4l1t13s& Translated…
Edahi
  • 59
  • 7
1
2 3
17 18