0

I'd like to generate a pattern for a python script, where any number of three words must exist in a specified pattern?

for example, given a sequence:

ATG GTC TGA CGA CGG CAG TAA AAA AAA GGG TGG GCA GCC TTT GAA GCC TTT

I'd like to find all occurrences of 19-21mers that contain at least one of any of the following words: TAG, TGA, or TAA

I tried to specify a pattern = '[A,G,C,T,\s]{21,26}^.*\b(TAT|TGA|CCC)\b.*$'

But it doesn't seem to work and I'm sure I'm doing something that shows what a noob I am.

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • 1
    To clarify, you're looking for the `TAG`, `TGA`, and `TAA` codons in any seven codon sequence? – timolawl May 15 '16 at 03:02
  • So to clarify, do overlaps count? So the sequence `ATG GTC TGA CGA CGG CAG TAA AAA AAA` should return matches of `ATG GTC TGA CGA CGG CAG TAA`, `GTC TGA CGA CGG CAG TAA AAA`, and `TGA CGA CGG CAG TAA AAA AAA`? – Anthony E May 15 '16 at 03:16
  • And why a 19 or 20-mer? Wouldn't any number of base pairs not divisible by 3 be an incomplete sequence? – Anthony E May 15 '16 at 03:19
  • It has to be within the codon, so overlaps don't count – user6306903 May 15 '16 at 03:24
  • Can you give an example? For the sequence `ATG GTC TGA CGA CGG CAG TAA AAA AAA` what should the matched result be? – Anthony E May 15 '16 at 03:25
  • incomplete sequences are OK, so they don't have to be divisible by 3... – user6306903 May 15 '16 at 03:25
  • The biopolymer phrasing is ambiguous. Are you referring to the DNA sequence? If so, then specifying a range between 19-21 doesn't make sense because you'll be messing with the reading frame. If you're referring to the translated polypeptide, that isn't the best way of phrasing it either because the 3 nucleotide reading frames are simply referred to as codons. – timolawl May 15 '16 at 03:25
  • @timolawl Exactly, there's more than one valid result for a given sequence. – Anthony E May 15 '16 at 03:26
  • Anthony E- an example would be CAG TAA AAA AAA GGG TGG where a TAA is present within the 18mer – user6306903 May 15 '16 at 03:27
  • Sorry, but you're not being clear -- `CAG TAA AAA AAA GGG TGG` is an 18-mer it doesn't satisfy the 19-21mer requirement. – Anthony E May 15 '16 at 03:28
  • And by that logic `CGG CAG TAA AAA AAA GGG`, et. al. should also be valid since it contains a `TAA` as well. – Anthony E May 15 '16 at 03:29
  • timolawl- its OK for me to mess with the ORF. I am indeed referring to the DNA... just trying to pick out patterns within a range of an ORF. – user6306903 May 15 '16 at 03:29
  • Anthony E: yes thanks for the correction about the length... at this point I am actually fluid about the length (its not the heart of my problem). And you are right that there are other examples within that pattern. – user6306903 May 15 '16 at 03:31
  • OK got it, so an open reading frame of 7 can be used to satisfy the length. – Anthony E May 15 '16 at 03:32
  • yes! I just need to figure out how to specify the requirements for the stop codons... – user6306903 May 15 '16 at 03:33
  • Might be better to remove spaces in the string before getting a regular expression to parse the string then since reading frames isn't important. – timolawl May 15 '16 at 03:35
  • Reading frame IS important. I will be searching for patterns within an ORF. The specific length isn't the key here. – user6306903 May 15 '16 at 03:40
  • Am I misunderstanding "its OK for me to mess with the ORF"? Are you testing all 3 possible reading frames for the stop codons? I'll simply assume 7 codons for the length. – timolawl May 15 '16 at 03:43
  • I don't want to find stop codons if they aren't in frame with the formatted codons above... – user6306903 May 15 '16 at 03:54
  • @timolawl This would only return one match for the first 7 codons in the string even if the matching codon wasn't in the first 7 codons. Pretty sure a simple regex can't be used for this problem because of this reason. You would have to define a separate regex for each of the 7 different positions the matching codon could be in. – Anthony E May 15 '16 at 04:17
  • @AnthonyE Ah you're right. I had a similar reply to a question that popped up a few days ago. Not sure why I made the same mistake. The conclusion was also that regex alone wasn't enough to solve that question. Then again, python's regex is more expressive than that of JS. Good catch. – timolawl May 15 '16 at 04:23

2 Answers2

1

I don't think regex is expressive enough to handle this with the length requirement.

However, you can break down this problem by using a window iterator to simulate an open read frame:

# From http://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python:

from itertools import islice

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

sequence = "ATG GTC TGA CGA CGG CAG TAA AAA AAA GGG TGG GCA GCC TTT GAA GCC TTT"
codons = sequence.split()

orf = window(codons, 7)
matching_codons = ['TGA', 'TAA', 'TAG']

[sequence for sequence in orf if any(codon in matching_codons for codon in sequence)]

Dissecting the code

orf = window(codons, 7)

This defines a generator which will return each frame of length 7, moving the frame by 1 each iteration.

Then, the list comprehension iterates does two things.

  1. It iterates over each sequence in our ORF:

    [sequence for sequence in orf] # returns all possible frames of length 7 in sequence

  2. It filters the result, only returning sequences that contain any of the valid codons:

    [sequence for sequence in orf if any(codon in ['TGA', 'TAA', 'TAG'] for codon in sequence)] # Only matches sequences matching 'TGA', 'TAA', or 'TAG'

Finally, if you want the result to be the substrings themselves, use the following list comprension:

[' '.join(sequence) for sequence in window(codons, 7) if any(codon in ['TGA', 'TAA', 'TAG'] for codon in sequence)]

Result:

['ATG GTC TGA CGA CGG CAG TAA', 'GTC TGA CGA CGG CAG TAA AAA', 'TGA CGA CGG CAG TAA AAA AAA', 'CGA CGG CAG TAA AAA AAA GGG', 'CGG CAG TAA AAA AAA GGG TGG', 'CAG TAA AAA AAA GGG TGG GCA', 'TAA AAA AAA GGG TGG GCA GCC']
Anthony E
  • 11,072
  • 2
  • 24
  • 44
0
import re
string_to_read="ATG GTC TGA CGA CGG CAG TAA AAA AAA GGG TGG GCA GCC TTT GAA GCC TTT"
res=re.search('(TAT|TGA|TAA)', string_to_read)
if res:
    print('matched %s'%res.groups())

This regex will tell you if any of those 3 sequences exist in the string you are testing against.

If you need to match that all 3 exist, you could test for all 3 independently

if re.match('TAT', string_to_read) and re.match('TGA', string_to_read) and re.match('TAA', string_to_read):
    print('has all 3')

instead of some clever regex with all the combinatorics of those 3 sequences. If you don't wanna run 3 separate regexes, you could do something with a regex like (TAT)|(TGA)|(AAA) and then keeping a tally of all the groups you matched, see if you've hit all of them.

djcrabhat
  • 464
  • 5
  • 10
  • This doesn't work with the length requirement though because it would match `TAT TGA TAA` which isn't a 21-mer. – Anthony E May 15 '16 at 03:18
  • Oh, I guess I didn't understand that part :) Thought it was just a simple string match :) – djcrabhat May 15 '16 at 03:19
  • Yeah, it complicates things significantly. I'm not sure regexes are expressive enough for the length requirement. – Anthony E May 15 '16 at 03:21
  • I suppose you could just feed long-enough strings in to that regex – djcrabhat May 15 '16 at 03:31
  • hmmm... not sure I understand... what do you mean 'feed long-enough strings' – user6306903 May 15 '16 at 03:37
  • I mean I'm pretty sure I'm not really understanding the question here, but if I had some list of 50 21-base sequences, I'd just iterate over that list and check if they contain each those sequences. Wouldn't even need a regex, just `'AAA' in my_string` – djcrabhat May 15 '16 at 04:31