I am working with a very large zipped text file "values.gz", containing multiple values per gene with geneID in the 2nd column, and another small text file containing a list of unique genes in the fourth column ("bed.txt"). I am trying to extract information from the values.gz file (without reading it all but using the dictionary from bed.txt) for a subset of genes listed in the bed.txt file, and save it in a separate file per gene. I cannot do this in bash as the file is too large to unzip, so I am trying in python but I am a newbie, and I am stuck: I have got as far as extracting the matched genes and saving them in a file, but how do I save to a different file per gene?
For example, the bed.txt file looks like this:
chr start end gene_id
1 11868 11869 ENSG00000223972
1 29552 29553 ENSG00000227232
1 29806 29807 ENSG00000243485
The values.gz file looks like this:
SNP gene v1 v2 v3
rs1 ENSG00000223972 0.09 0.8 0.5
rs1 ENSG00000227232 -0.06 -0.5 0.6
rs1 ENSG00000237683 -0.05 -0.5 0.5
rs2 ENSG00000223972 0.09 0.8 0.3
rs3 ENSG00000227232 0.09 0.8 0.3
The output I am looking for is: ENSG00000223972.gz
rs1 ENSG00000223972 0.09 0.8 0.5
rs2 ENSG00000223972 0.09 0.8 0.3
ENSG00000227232.gz
rs1 ENSG00000227232 -0.06 -0.5 0.6
rs3 ENSG00000227232 0.09 0.8 0.3
(All genes in values.gz should have a matching value in bed (but not 100% sure!), but the bed file will have more genes listed that do not match the values.gz file)
#! /usr/bin/env python
import gzip
lookup = dict()
my_file = open("bed.txt","r")
for line in my_file.readlines():
row = line.split()
lookup[row[3]] = row[1:]
# print lookup
my_file.close()
with open('MyOutFile', 'w') as outfile:
with gzip.open("values.gz", "r") as eqtl:
for line in eqtl.readlines():
for key in lookup:
if line.find(key) > -1:
outfile.write(line)
After Paisanco's suggestions:
with gzip.open("values.gz", "r") as eqtl:
for line in eqtl.readlines():
row = line.split()
gene = row[1]
filename = gene+'.txt'
if gene in lookup:
# assign unique file name here, then
if os.path.exists(filename):
append_write = 'a'
else:
append_write = 'w'
with open(filename,append_write) as outfile:
outfile.write(line)