0

I have a list of genes (gene1, gene2,...) that contains all my genes of interest. I would like to extract now for each gene seperately only the free energy data to process it seperately.

My data set looks like this and contains information for more than 500 genes:

    ==> data/gene1_free_energy.dat <==
    0                0                0
    1                0                0
    2                0                2.3
    3                0                5.4
    .
    .
    .

    ==> data/gene1_rare_enrichment.dat <==
    7         0.166667         0.939498
    8         0.222222         0.930714
    9         0.0555556        0.998125
    10        0.166667         0.826133
    .
    .
    .

    ==> data/gene2_free_energy.dat <==
    0                0                0
    1                0                0
    2                0                2.3
    3                0                5.4
    .
    .
    .

    ==> data/gene2_rare_enrichment.dat <==
    7         0.166667         0.939498
    8         0.222222         0.930714
    9         0.0555556        0.998125
    10        0.166667         0.826133
    .
    .
    .

To extract now the data between two delimiters I found this answer very helpful: Repeatedly extract a line between two delimiters in a text file, Python but I cannot figure out how to implement the gene name as a varaible.

    import re
    with open(input1) as fp:
    for result in re.findall('==> data/gene1_free_energy.dat <==(.*?)==>  data/gene1_rare_enrichment.dat <==', fp.read(), re.S):
        print (result) #or save this in a dictionary or whatever

This nicely prints it for gene1.

I tried the following, but it does not work.

    import re
    for name in gene_list: # this is my list of included genes
        with open(input1) as fp:
        for result in re.findall('==> data/' + name + '_free_energy.dat <==(.*?)==>  data/'+ name +'_rare_enrichment.dat <==', fp.read(), re.S):
            print (result)

Is there a way to write such a loop? Or is there another more clever way to extract the data I need?

blubber
  • 13
  • 4
  • What precisely goes wrong when you say it does not work? I notice that `(.*?)==> data/` contains two spaces instead of the one in your sample file. Maybe that's an issue? Also, repeatedly calling `fp.read()` may not work as intented: I'd read the contents of the file first, storing them in a variable, and only then start the `for name` loop. – brm Feb 07 '18 at 19:00
  • You are totally right. It is just the two spaces that caused me to not get any output. And changing the order to just once call `fp.read()` is definitely a very good idea. Thanks a lot! – blubber Feb 07 '18 at 21:50

1 Answers1

0
with open('data.txt') as f:
    RC = False
    D = []
    key = []
    d = []
    for line in f:
        if 'free_energy' in line:
            RC = True
            key.append(line.split('/')[1].split('_')[0])
        if RC:
            if '==>' not in line:
                d.append(line.split())
        if 'rare_enrichment' in line:
            RC = False
            D.append(d)
            d = []



data = {k: a for k, a in zip(key, D)}

output: {'gene1': [['0', '0', '0'],
         ['1', '0', '0'],
         ['2', '0', '2.3'],
         ['3', '0', '5.4']],
         'gene2': [['0', '0', '0'],
         ['1', '0', '0'],
         ['2', '0', '2.3'],
         ['3', '0', '5.4']]}
Osman Mamun
  • 2,864
  • 1
  • 16
  • 22