The glob.glob function to extract data from files

Question

I am trying to run the script below. The intention of the script is to open different fasta files one after the other, and extract the geneID. The script works well if I don't use the glob.glob function. I get this message TypeError: coercing to Unicode: need string or buffer, list found

files='/home/pathtofiles/files'
    #print files
    #sys.exit()
    for file in files:
        fastas=sorted(glob.glob(files + '/*.fasta'))
        #print fastas[0]
        output_handle=(open(fastas, 'r+'))
        genes_files=list(SeqIO.parse(output_handle, 'fasta'))
        geneID=genes_files[0].id
        print geneID

I am running of ideas on how to direct the script to open when file after another to give me the require information.

Is `fastas` a list? i.e. in here `output_handle=(open(fastas, 'r+'))` — doctorlove, Sep 27 '17 at 15:47

BioGeek · Answer 1 · 2017-09-28T09:36:33.017

I see what you are trying to do, but let me first explain why your current approach is not working.

You have a path to a directory with fasta files and you want to loop over the files in that directory. But observe what happens if we do:

>>> files='/home/pathtofiles/files'
>>> for file in files:
>>>    print file
/
h
o
m
e
/
p
a
t
h
t
o
f
i
l
e
s
/
f
i
l
e
s

Not the list of filenames you expected! files is a string and when you apply a for loop on a string you simply iterate over the characters in that string.

Also, as doctorlove correctly observed, in your code fastas is a list and open expects a path to a file as first argument. That's why you get the TypeError: ... need string, ... list found.

As an aside (and this is more a problem on Windows then on Linux or Mac), but it is good practice to always use raw string literals (prefix the string with an r) when working with pathnames to prevent the unwanted expansion of backslash escaped sequences like \n and \t to newline and tab.

>>> path = 'C:\Users\norah\temp'
>>> print path
C:\Users
orah    emp
>>> path = r'C:\Users\norah\temp'
>>> print path
C:\Users\norah\temp

Another good practice is to use os.path.join() when combining pathnames and filenames. This prevents subtle bugs where your script works on your machine bug gives an error on the machine of your colleague who has a different operating system.

I would also recommend using the with statement when opening files. This assures that the filehandle gets properly closed when you're done with it.

As a final remark, file is a built-in function in Python and it is bad practice to use a variable with the same name as a built-in function because that can cause bugs or confusion later on.

Combing all of the above, I would rewrite your code like this:

import os
import glob
from Bio import SeqIO

path = r'/home/pathtofiles/files'
pattern = os.path.join(path, '*.fasta')
for fasta_path in sorted(glob.glob(pattern)):
    print fasta_path
    with open(fasta_path, 'r+') as output_handle:
        genes_records = SeqIO.parse(output_handle, 'fasta')
        for gene_record in genes_records:
            print gene_record.id

Ana · Accepted Answer · 2017-09-28T13:17:10.797

-2

This is way I solved the problem, and this script works.

    import os,sys
    import glob
    from Bio import SeqIO

def extracting_information_gene_id():
    #to extract geneID information and add the reference gene to each different file

    files=sorted(glob.glob('/home/path_to_files/files/*.fasta'))
    #print file
    #sys.exit()
    for file in files:
        #print file
        output_handle=open(file, 'r+')
        ref_genes=list(SeqIO.parse(output_handle, 'fasta'))
        geneID=ref_genes[0].id
        #print geneID
        #sys.exit()

        #to extract the geneID as a reference record from the genes_files
        query_genes=(SeqIO.index('/home/path_to_file/file.fa', 'fasta'))
        #print query_genes[geneID].format('fasta') #check point
        #sys.exit()
        ref_gene=query_genes[geneID].format('fasta')
        #print ref_gene #check point
        #sys.exit()
        output_handle.write(str(ref_gene))
        output_handle.close()
        query_genes.close()

extracting_information_gene_id()
print 'Reference gene sequence have been added'

edited Sep 28 '17 at 13:17

answered Sep 28 '17 at 11:43

Ana

131
1
14

What is the use of `.format('fasta')` in the line `ref_gene=query_genes[geneID].format('fasta')`? – BioGeek Sep 28 '17 at 12:29
1

Please fix your indentation, because the script as posted will not work. – BioGeek Sep 28 '17 at 12:33
1

@BioGeek. [link](http://biopython.org/DIST/docs/api/Bio.SeqIO-pysrc.html). This is a very elegant way to format my extracted sequence as fasta format. Instead of going through the` '>' + gene +'\n' + seq +'\n'`. It is one of Biopython features. – Ana Sep 28 '17 at 13:29
@BioGeek. Also it prints it in a nice fasta format. – Ana Sep 28 '17 at 14:51
cool! I didn't knew you could do that. I learned something new today, thanks! :-) – BioGeek Sep 29 '17 at 08:00
@BioGeek. We always learn news things, that the good thing about Science. I don't know if you can use it always but put a link in my previous comment that you can have a look, and see it uses. – Ana Sep 29 '17 at 09:47

The glob.glob function to extract data from files

2 Answers2