I have a list of genes (gene1, gene2,...) that contains all my genes of interest. I would like to extract now for each gene seperately only the free energy data to process it seperately.
My data set looks like this and contains information for more than 500 genes:
==> data/gene1_free_energy.dat <==
0 0 0
1 0 0
2 0 2.3
3 0 5.4
.
.
.
==> data/gene1_rare_enrichment.dat <==
7 0.166667 0.939498
8 0.222222 0.930714
9 0.0555556 0.998125
10 0.166667 0.826133
.
.
.
==> data/gene2_free_energy.dat <==
0 0 0
1 0 0
2 0 2.3
3 0 5.4
.
.
.
==> data/gene2_rare_enrichment.dat <==
7 0.166667 0.939498
8 0.222222 0.930714
9 0.0555556 0.998125
10 0.166667 0.826133
.
.
.
To extract now the data between two delimiters I found this answer very helpful: Repeatedly extract a line between two delimiters in a text file, Python but I cannot figure out how to implement the gene name as a varaible.
import re
with open(input1) as fp:
for result in re.findall('==> data/gene1_free_energy.dat <==(.*?)==> data/gene1_rare_enrichment.dat <==', fp.read(), re.S):
print (result) #or save this in a dictionary or whatever
This nicely prints it for gene1.
I tried the following, but it does not work.
import re
for name in gene_list: # this is my list of included genes
with open(input1) as fp:
for result in re.findall('==> data/' + name + '_free_energy.dat <==(.*?)==> data/'+ name +'_rare_enrichment.dat <==', fp.read(), re.S):
print (result)
Is there a way to write such a loop? Or is there another more clever way to extract the data I need?