Questions tagged [dna-sequence]

A string representing the nucleotide sequence of the deoxyribonucleic acid, the molecule that holds the genes that constitute the genetic code.

Deoxyribonucleic acid (DNA) contains the genetic instructions specifying the biological development of all cellular life. DNA consists of two long polymers of simple units called nucleotides.

DNA single chain sequences are commonly represented as a string of uppercase letters that correspond to the nucleotide units in the sequence (A, G, C, T). More seldom, ambiquity codes are also used to specify that several alternative nucleotides are possible in the given position (R - A or G, Y - C or T, see complete table.

A great amount of work in bioinformatics is related with the analysis and comparison of these strings. DNA sequences may be very long or they sets may get very large (gigabytes).

Related tags:

475 questions
108
votes
11 answers

How much storage would be required to store a human genome?

I'm looking for the amount of storage in bytes (MB, GB, TB, etc.) required to store a single human genome. I read a few articles on Wikipedia about DNA, chromosomes, base pairs, genes, and have some rough guess, but before disclosing anything I'd…
Milan Babuškov
  • 59,775
  • 49
  • 126
  • 179
37
votes
6 answers

How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?

I need to generate a random walk based on the DNA sequence of a virus, given its base pair sequence of 2k base pairs. The sequence looks like "ATGCGTCGTAACGT". The path should turn right for an A, left for a T, go upwards for a G and downwards for a…
25
votes
14 answers

Search for string allowing for one mismatch in any location of the string

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite). I am not sure how large the genome is, but much longer than 230,000…
Vincent
  • 1,579
  • 4
  • 23
  • 38
16
votes
3 answers

how to match dna sequence pattern

I am getting a trouble finding an approach to solve this problem. Input-output sequences are as follows : **input1 :** aaagctgctagag **output1 :** a3gct2ag2 **input2 :** aaaaaaagctaagctaag **output2 :** a6agcta2ag Input nsequence can be of…
user2442890
  • 161
  • 1
  • 3
15
votes
4 answers

Fast algorithms for finding unique sets in two very long sequences of text

I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly…
person
  • 205
  • 2
  • 6
14
votes
6 answers

Overlapping matches in R

I have searched and was able to find this forum discussion for achieving the effect of overlapping matches. I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about…
hwnd
  • 69,796
  • 4
  • 95
  • 132
13
votes
10 answers

Reverse complement of DNA strand using Python

I have a DNA sequence and would like to get reverse complement of it using Python. It is in one of the columns of a CSV file and I'd like to write the reverse complement to another column in the same file. The tricky part is, there are a few cells…
user3783999
  • 571
  • 2
  • 7
  • 17
8
votes
4 answers

Improving code design of DNA alignment degapping

This is a question regarding a more efficient code design: Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the…
Michael Gruenstaeudl
  • 1,609
  • 1
  • 17
  • 31
8
votes
3 answers

Finding matching strings when comparing two lists

I am trying to compare two lists to see if there are any matching strings within the lists. lets say list_1 is ['GAAGGTCGAA', 'GAAGGTCGA', 'AAGGTCGAA', 'GAAGGTCG', 'AAGGTCGA', 'AGGTCGAA', 'GAAGGTC', 'AAGGTCG', 'AGGTCGA', 'GGTCGAA', 'GAAGGT',…
Rsherrill
  • 129
  • 3
  • 4
  • 12
7
votes
1 answer

chaos game for DNA sequences

I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome =…
Layla
  • 5,234
  • 15
  • 51
  • 66
7
votes
2 answers

Translation DNA to Protein

I am a biology graduate student and I taught myself a very limited amount of python in the past few months to deal with some data I have. I am not asking for homework help, this is for a research project. With this code I intend to take a portion of…
6
votes
5 answers

Generating Synthetic DNA Sequence with Substitution Rate

Given these inputs: my $init_seq = "AAAAAAAAAA" #length 10 bp my $sub_rate = 0.003; my $nof_tags = 1000; my @dna = qw( A C G T ); I want to generate: One thousand length-10 tags Substitution rate for each position in a tag is 0.003 Yielding…
neversaint
  • 60,904
  • 137
  • 310
  • 477
6
votes
4 answers

how to extend ambiguous dna sequence

Let's say you have a DNA sequence like this : AATCRVTAA where R and V are ambiguous values of DNA nucleotides, where R represents either A or G and V represents A, C or G. Is there a Biopython method to generate all the different combinations of…
jrjc
  • 21,103
  • 9
  • 64
  • 78
6
votes
5 answers

matching and counting strings (k-mer of DNA) in R

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user)…
Cina
  • 9,759
  • 4
  • 20
  • 36
6
votes
5 answers

Generating random sequences of DNA

I am trying to generate random sequences of DNA in python using random numbers and random strings. But I am getting only one string as my output. For example: If I give DNA of length 5 (String(5)), I should get an output "CTGAT". Similarly if I give…
Rachel
  • 383
  • 2
  • 4
  • 13
1
2 3
31 32