-6

I have a big text file who's lines are composed in this format:

Query: 1586     cccaagatgagctgcagccccccagagagagctctgcacgtcaccaagtaaccaggcccc 1645
Sbjct: 27455708 cccaagatgagctgcagccccccagagagagctctgcacgtcaccaagtaaccaggcccc 27455649

Query: 1646     agcctccaggcccccaactccgcccagcctctccccgctctggatcctgcactctaacac 1705      
Sbjct: 27455648 agcctccaggcccccaactccgcccagcctctccccgctctggatcctgcactctaacac 27455589

Query: 1706     tcgactctgctgctcatgggaagaacagaattgctcctgcatgcaactaattcaataaaa 1765              
Sbjct: 27455588 tcgactctgctgctcatgggaagaacagaattgctcctgcatgcaactaattcaataaaa 27455529

For each line, I want to be able to extract only the varying sequences of agtc while removing the other character (query, sbjct and varying numbers) so that the final string would look like this

line1 = cccaagatgagctgcagccccccagagagagctctgcacgtcaccaagtaaccaggcccc
line2 = cccaagatgagctgcagccccccagagagagctctgcacgtcaccaagtaaccaggcccc
etc...

I've been working on this for awhile and can't get it to work. I've tried the re module and .translate but to not results. I am programming in python 3.4. Thank you!

Brad Larson
  • 170,088
  • 45
  • 397
  • 571
Peter
  • 15
  • 4

1 Answers1

3

While you could use regular expressions (like you have attempted) the example your provide can be easily split up by using agtc_part = line.split()[2]

This splits a given line into a list of strings where the delimiter is a space. Indexing starts from 0, so the part with agct in is indexed by 2.

Note that calling split() without explicitly passing in an argument indicating the character to split on not only splits on a space character, but will also group sequential spaces together rather than splitting on each one. This is important in your case because you have a different number of white space characters between the number and the agct string.

Example:

>>> "aaa   bbb".split()
['aaa', 'bbb']
>>> "aaa   bbb".split(' ')
['aaa', '', '', 'bbb']
three_pineapples
  • 11,579
  • 5
  • 38
  • 75
  • 1
    Just a minor point, the default delimiter is a space, so you can simply `line.split()[2]` – Burhan Khalid Oct 12 '14 at 04:30
  • Yep, I was aware of that, but thought it might be better to make it explicit :) – three_pineapples Oct 12 '14 at 05:02
  • @three_pineapples: Your solution gives a wrong result for the line starting with "Query". With the usage of the explicit space (`' '`) you ran into a trap. From the documentation: "[If sep is given, consecutive delimiters are not grouped together](https://docs.python.org/3/library/stdtypes.html#str.split)". – Matthias Oct 12 '14 at 05:50
  • @Matthias Oh! When I wrote the answer, the example provided was not in a code block in the post so there was only 1 space visible between the number and the text. I thus thought everything was separated by a single space, not padded to look nice when printed. I'll update my answer. – three_pineapples Oct 12 '14 at 06:09
  • @three_pineapples: Looks like I have too give you an upvote now. :-) – Matthias Oct 12 '14 at 18:01