complex regex matches in python

Question

I have a txt file that contains the following data:

chrI

ATGCCTTGGGCAACGGT...(multiple lines)

chrII

AGGTTGGCCAAGGTT...(multiple lines)

I want to first find 'chrI' and then iterate through the multiple lines of ATGC until I find the xth char. Then I want to print the xth char until the yth char. I have been using regex but once I have located the line containing chrI, I don't know how to continue iterating to find the xth char.

Here is my code:

for i, line in enumerate(sacc_gff):
    for match in re.finditer(chromo_val, line):
        print(line)
        for match in re.finditer(r"[ATGC]{%d},{%d}\Z" % (int(amino_start), int(amino_end)), line):
            print(match.group())

What the variables mean:

chromo_val = chrI

amino_start = (some start point my program found)

amino_end = (some end point my program found)

Note: amino_start and amino_end need to be in variable form.

Please let me know if I could clarify anything for you, Thank you.

Can you clarify what you mean by this part: *iterate through the multiple lines of ATGC until I find the xth char. Then I want to print the xth char until the yth char.* — Shashank, Apr 29 '15 at 03:48
Do your sequences follow the fasta format? If not can you can easily do this by adding a ">" at the beginning of each name. This would allow to use biopython, which would make this very easy. — Darwin, Apr 29 '15 at 04:09
samgak: lets assume the xth char is 2 and the yth char is 4, then using the following data: ATGCCCGT, I would have TGC printed to the console. — Medici, Apr 29 '15 at 04:27
Darwin: Yes my data is from a .gff file and ">" is before chrI and chrII — Medici, Apr 29 '15 at 04:29
Darwin: well I am using linux but I am writing the file with emacs — Medici, Apr 29 '15 at 04:34
I've edited my answer below to include a sample of fasta formatted data. It sounds like your .gff file should be a .fa file — afinit, Apr 29 '15 at 04:47

afinit · Accepted Answer · 2015-04-29T05:17:38.803

3

It looks like you are working with fasta data, so I will provide an answer with that in mind, but if it isn't you can use the sub_sequence selection part still.

fasta_data = {} # creates an empty dictionary
with open( fasta_file, 'r' ) as fh:
    for line in fh:
        if line[0] == '>':
            seq_id = line.rstrip()[1:] # strip newline character and remove leading '>' character
            fasta_data[seq_id] = ''
        else:
            fasta_data[seq_id] += line.rstrip()

# return substring from chromosome 'chrI' with a first character at amino_start up to but not including amino_end
sequence_string1 = fasta_data['chrI'][amino_start:amino_end]
# return substring from chromosome 'chrII' with a first character at amino_start up to and including amino_end
sequence_string2 = fasta_data['chrII'][amino_start:amino_end+1]

fasta format:

>chr1
ATTTATATATAT
ATGGCGCGATCG
>chr2
AATCGCTGCTGC

edited Apr 29 '15 at 05:17

answered Apr 29 '15 at 04:21

afinit

985
1
9
16

Thank you for your answer benie, I have never worked with 'with' so I am reading up on it right now. – Medici Apr 29 '15 at 04:35
benine, I am unable to see how you are able to iterate and find 'amino_start' till 'amino_end'? Also are you creating an empty list with 'fasta_data = {}'? – Medici Apr 29 '15 at 05:12
I added a couple of comments to the code. `fasta_data = {}` creates an empty dictionary to store the sequences in with their IDs as keys. and the `fasta_data['chrI'][amino_start:amino_end]` returns a substring from amino_start index to amino_end index (remember that this doesn't include the character at amino_end). If you need to iterate over the string, you can do so with `for c in sequence_string1:`. (for details on iterating over a string in python: http://stackoverflow.com/questions/538346/iterating-over-a-string) – afinit Apr 29 '15 at 05:23
This looks promising benine, I will try to incorporate it into my code and let you know how it went. Thank you for your help, I will keep you updated :) – Medici Apr 29 '15 at 05:44
@Medici I think I should have been more clear and mentioned that var[:] notation is called slicing notation in python. Further explanation can be found [here](http://stackoverflow.com/questions/509211/explain-pythons-slice-notation). This is about lists but applies strings as well – afinit Apr 29 '15 at 05:47
benine, I finally understood your code however when I incorporated it into my program I am getting an error. The error states: NameError: name 'seq_id' is not defined. However I am using your method to see if I can write something similar to this. – Medici Apr 29 '15 at 21:59
benine: I used something similar to your answer to get to my solution. Thank you. – Medici May 08 '15 at 18:27

score 0 · Answer 2 · answered Apr 29 '15 at 04:46

0

Since you are working with fasta files which are formatted like this:

>Chr1
ATCGACTACAAATTT
>Chr2
ACCTGCCGTAAAAATTTCC

and are a bioinformatics major I am guessing you will be manipulating sequences often I recommend install the perl package called FAST. Once this is installed to get the 2-14 character of every sequence you would do this:

fascut 2..14 fasta_file.fa

Here is the recent publication for FAST and github that contains a whole toolbox for manipulating molecule sequence data on the command line.

answered Apr 29 '15 at 04:46

Darwin

519
2
11

Darwin thank you for you answer but since this is a class problem I am only allowed to use python. However I will certainly take your input into consideration when I am data mining without such restrictions. :) – Medici Apr 29 '15 at 04:52
1

Use the biopython package then. – Darwin Apr 29 '15 at 05:05
I feel that my professor may frown upon that since it would take away from the "learning experience". Anyways I should probably e-mail him and ask for some guidance. Thank you again for your help. :) – Medici Apr 29 '15 at 05:14
So this may be a matter of preference, but as someone who works in bioinformatics, I think it can be bad practice to rely too much on external libraries. Especially for someone just learning to code and when the solution in native python is so simple. I'd also argue that the native solution is the more pythonic option. – afinit Apr 29 '15 at 05:15

complex regex matches in python

2 Answers2