I have a big text file who's lines are composed in this format:
Query: 1586 cccaagatgagctgcagccccccagagagagctctgcacgtcaccaagtaaccaggcccc 1645
Sbjct: 27455708 cccaagatgagctgcagccccccagagagagctctgcacgtcaccaagtaaccaggcccc 27455649
Query: 1646 agcctccaggcccccaactccgcccagcctctccccgctctggatcctgcactctaacac 1705
Sbjct: 27455648 agcctccaggcccccaactccgcccagcctctccccgctctggatcctgcactctaacac 27455589
Query: 1706 tcgactctgctgctcatgggaagaacagaattgctcctgcatgcaactaattcaataaaa 1765
Sbjct: 27455588 tcgactctgctgctcatgggaagaacagaattgctcctgcatgcaactaattcaataaaa 27455529
For each line, I want to be able to extract only the varying sequences of agtc while removing the other character (query, sbjct and varying numbers) so that the final string would look like this
line1 = cccaagatgagctgcagccccccagagagagctctgcacgtcaccaagtaaccaggcccc
line2 = cccaagatgagctgcagccccccagagagagctctgcacgtcaccaagtaaccaggcccc
etc...
I've been working on this for awhile and can't get it to work. I've tried the re module and .translate
but to not results. I am programming in python 3.4. Thank you!