2

I am working with a very large zipped text file "values.gz", containing multiple values per gene with geneID in the 2nd column, and another small text file containing a list of unique genes in the fourth column ("bed.txt"). I am trying to extract information from the values.gz file (without reading it all but using the dictionary from bed.txt) for a subset of genes listed in the bed.txt file, and save it in a separate file per gene. I cannot do this in bash as the file is too large to unzip, so I am trying in python but I am a newbie, and I am stuck: I have got as far as extracting the matched genes and saving them in a file, but how do I save to a different file per gene?

For example, the bed.txt file looks like this:

chr    start   end     gene_id
1       11868   11869   ENSG00000223972
1       29552   29553   ENSG00000227232
1       29806   29807   ENSG00000243485

The values.gz file looks like this:

SNP     gene    v1    v2  v3
rs1       ENSG00000223972       0.09      0.8       0.5
rs1       ENSG00000227232       -0.06     -0.5      0.6
rs1       ENSG00000237683       -0.05     -0.5      0.5
rs2       ENSG00000223972       0.09      0.8       0.3
rs3       ENSG00000227232       0.09      0.8      0.3

The output I am looking for is: ENSG00000223972.gz

rs1       ENSG00000223972       0.09      0.8       0.5
rs2       ENSG00000223972       0.09      0.8      0.3

ENSG00000227232.gz

rs1       ENSG00000227232       -0.06     -0.5      0.6
rs3       ENSG00000227232       0.09      0.8       0.3

(All genes in values.gz should have a matching value in bed (but not 100% sure!), but the bed file will have more genes listed that do not match the values.gz file)

#! /usr/bin/env python
import gzip

lookup = dict()
my_file = open("bed.txt","r") 
for line in my_file.readlines():
    row = line.split()
    lookup[row[3]] = row[1:]
    # print lookup
my_file.close()

with open('MyOutFile', 'w') as outfile:
    with gzip.open("values.gz", "r") as eqtl:
        for line in eqtl.readlines():
            for key in lookup:
                if line.find(key) > -1: 
                    outfile.write(line)

After Paisanco's suggestions:

with gzip.open("values.gz", "r") as eqtl:
    for line in eqtl.readlines():
        row = line.split()
        gene = row[1]
        filename = gene+'.txt'
        if gene in lookup:
           # assign unique file name here, then
           if os.path.exists(filename):
               append_write = 'a'
           else:
               append_write = 'w'
           with open(filename,append_write) as outfile:
              outfile.write(line)
user971102
  • 3,005
  • 4
  • 30
  • 37
  • Are the gene IDs in the first file , fourth column, the same as the gene IDs in the second (larger) file, second column? For each line of your larger file, you current code seems to be iterating through the entire dictionary - is that necessary or could you extract geneID from the line and test 'if geneID in lookup'? – paisanco Jun 08 '17 at 22:21
  • Hi paisanco, yes they are, although I have headers that are called in different ways, first file fourth column: "geneID", second file second column: "gene". Would what you suggest be more efficient? I didn't try to do that as the bed.txt file is quite small (well, 15,000 lines), but maybe it's worth it (?) – user971102 Jun 08 '17 at 22:29
  • Can you supply a few sample lines of input from each file? This will help people focus. Also, what sort of naming do you envision for the individual output files? – Prune Jun 08 '17 at 22:35
  • Sure! I updated the question... – user971102 Jun 08 '17 at 22:58

1 Answers1

1

There are two things you can do here.

One, it looks like you are storing the quantity representing the geneID in your lookup table from the first file. If that quantity is the same type and value in the second file as in the first, you can search your lookup table more efficiently like this:

Code snippet:

for line in eqtl.readlines():
   row = line.split()
   if row[1] in lookup:
       # do something...

Two, if you wanted a unique name for each gene, your file open and write should be the inner loop, instead of the outer. Something like this:

with gzip.open("values.gz", "r") as eqtl:
    for line in eqtl.readlines():
        row = line.split()
        if row[1] in lookup:
           # assign unique file name here, then
           with open ("UniqueFileNameForThisGene","w") as outfile:
              outfile.write(line)

It would be up to you how you wanted to assign a unique file name for each gene - perhaps using other data in the line?

paisanco
  • 4,098
  • 6
  • 27
  • 33
  • Thank you!... Can I assign the filename of the outfile to be the name of the gene and store it in a gz file? These will be huge files as well... – user971102 Jun 08 '17 at 23:03
  • Sure, there is an existing question with answers on how to write to gzipped files at https://stackoverflow.com/questions/8156707/gzip-a-file-in-python/11524322#11524322 – paisanco Jun 08 '17 at 23:40
  • I am not getting the right output here, it's saving only the last match, do I need to first save the line into another data? Also, I can redirect output to gzfile but my question was how do I name the file based on the key? – user971102 Jun 09 '17 at 00:00
  • If values.gz has multiple entries for the same gene, you want to reopen the same output file for that gene with access append as you iterate through the lines. You might want to have a dictionary of filenames for each gene to look up the file name. – paisanco Jun 09 '17 at 00:03
  • I am confused then...doesn't it make more sense to find all values for a gene first and then move on to another row from the bed file? – user971102 Jun 09 '17 at 00:05
  • In general since the values file is large, it's advantageous to iterate through its lines as few times as possible (once ideally). – paisanco Jun 09 '17 at 00:07
  • Oh I see... Ok thank you! This works in my small example file, but I haven't tried it on my huge gz file with many many entries per gene, before I do could I just check that what I have after your comments is the most efficient (I edited the question)? In particular I am worried about doing this with my large file for two reasons: 1) many opened files, is this a problem? 2) I would prefer by gene because at least I know for sure that all the rows referring to a gene are in the file (and if the script fails I can resubmit only specific genes); what do you think? thanks again for all the help – user971102 Jun 09 '17 at 02:10