0

From this link http://www.gene-regulation.com/cgi-bin/pub/programs/pmatch/bin/p-match.cgi produced result that I need to process in order to obtain only sequence ID, start and end position. What are the ways I can extract coordinate information from the result? Below is example result.

Scanning sequence ID:   BEST1_HUMAN

              150 (-)  1.000  0.997  GGAAAggccc                                   R05891
              354 (+)  0.988  0.981  gtgtAGACAtt                                  R06227
V$CREL_01c-RelV$EVI1_05Evi-1

Scanning sequence ID:   4F2_HUMAN

              365 (+)  1.000  1.000  gggacCTACA                                   R05884
               789 (-)  1.000  1.000  gcgCGAAA                                       R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F

Expected output:

Sequence ID start end
(end site is the number of short sequence GGAAAggccc added to start site).

BEST1_HUMAN 150 160
BEST1_HUMAN 354 365
4F2_HUMAN   365 375
4F2_HUMAN   789 797

Can anyone help me?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Xiong89
  • 767
  • 2
  • 13
  • 24

1 Answers1

1

Use the snippet from this answer to split your result into evenly sized chunks and extract your desired data:

def chunks(l, n):
    #Generator to yield n sized chunks from l
    for i in xrange(0, len(l), n):
        yield l[i: i + n]

with open('p_match.txt') as f:
    for chunk in chunks(f.readlines(), 6):
        sequence_id = chunk[0].split()[-1].strip()
        for i in (2,3):
            start = int(chunk[i].split()[0].strip())
            sequence = chunk[i].split()[-2].strip()
            stop = start + len(sequence)
            print sequence_id, start, stop

Edit: Apparently the result can contain a variable number of start positions, so then the above solution of splitting in evenly sized chunks doesn't work. You could then go the regex route or go through the file line by line:

with open('p_match.txt') as f:
    text = f.read()
    chunks = text.split('Scanning sequence ID:')
    for chunk in chunks:
        if chunk:
            lines = chunk.split('\n')
            sequence_id = lines[0].strip()
            for line in lines:
                if line.startswith('              '):
                    start = int(line.split()[0].strip())
                    sequence = line.split()[-2].strip()
                    stop = start + len(sequence)
                    print sequence_id, start, stop
Community
  • 1
  • 1
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • May I know how can I modify the code where I have different number of start and end site? especially for this code for i in (2,3). Thanks. – Xiong89 Jun 16 '15 at 08:34
  • it solved perfectly. Thank you so much. I think the last 3 lines of codes are not necessary as it will double the output. – Xiong89 Jun 16 '15 at 09:32
  • I realized for the count of the short sequence it seems that some base pairs were missing out. – Xiong89 Jun 16 '15 at 09:48