0

I'm trying to download from the logins by adding a suffix: the for loop works well i.e. I managed to download the sequences until this url does not exist and the execution stops. So I want that if this url doesn't exist, the script finishes running. Also, I would like to please also put all the output files in a folder that takes the name of the species written in "input". Thanking you. Here is a part of the script:

species = input("Bacteria species ? : ")
TypeSeq= input ("fna ? or faa ? :")   

species = input("Bacteria species ? : ")
TypeSeq= input ("fna  ? ou faa  ?")   

if data["#Organism/Name"].str.contains(species, case = False).any():

    print(data.loc[data["#Organism/Name"].str.contains(species, case = False)]['Status'].value_counts())  
    FTP_list = data.loc[data["#Organism/Name"].str.contains(species, case = False)]["FTP Path"].values

if  TypeSeq == "faa" : 
        try : 
            for url in FTP_list:
                 
                parts = urllib.parse.urlparse(url)
                parts.path
                posixpath.basename(parts.path)
                suffix = "_protein.faa.gz"
                prefix = posixpath.basename(parts.path) 
                print(prefix+suffix)
                
                path = posixpath.join(parts.path, prefix+suffix)
                ret = parts._replace(path=path) 
                
                sequence=wget.download(urllib.parse.urlunparse(ret))
        except :
            print ("")
MattDMo
  • 100,794
  • 21
  • 241
  • 231
Yad.yos
  • 45
  • 6
  • you can try moving the `try` block into the for loop to encapsulate just `urllib.parse.urlparse()` if that is the segment that is throwing the error – Null Salad Jun 14 '21 at 22:54
  • Just a side note, you should almost [never use `except:`](https://stackoverflow.com/a/18982726/3888719) without specifying the exception type. `except:` will catch even system events like SystemExit or KeyboardInterrupt, making your program difficult to stop. Instead, specify `except Exception`, or even better, specify the exception type you want to handle. – Michael Delgado Jun 15 '21 at 05:36

1 Answers1

1

You have put your loop inside the try block, so whenever a url is not found, it throws an error and moves out of the loop, then caught by the except block. This stops the execution of your script. To fix it, put try-except block inside of your loop, so that after catching the error it moves to next url.

wget.download function takes an out parameter to specify name of the downloaded file, or directory where you want to download the file. You can use it to put all output files for a species in a folder.

Try the below code:

import os

species = input("Bacteria species ? : ")
TypeSeq= input ("fna ? or faa ? :")   

species = input("Bacteria species ? : ")
TypeSeq= input ("fna  ? ou faa  ?")   

if data["#Organism/Name"].str.contains(species, case = False).any():

    print(data.loc[data["#Organism/Name"].str.contains(species, case = False)]['Status'].value_counts())  
    FTP_list = data.loc[data["#Organism/Name"].str.contains(species, case = False)]["FTP Path"].values

if  TypeSeq == "faa" :
    
    if not os.path.exists(species):
        os.makedirs(species)
    
    for url in FTP_list:
        try : 
            parts = urllib.parse.urlparse(url)
            parts.path
            posixpath.basename(parts.path)
            suffix = "_protein.faa.gz"
            prefix = posixpath.basename(parts.path) 
            print(prefix+suffix)

            path = posixpath.join(parts.path, prefix+suffix)
            ret = parts._replace(path=path) 

            sequence = wget.download(urllib.parse.urlunparse(ret), out=species)
        except :
            print ("")
Ank
  • 1,704
  • 9
  • 14
  • thank you very much for your answer, but the code doesn't work. It's a dataframe with a column named "FTP Path" – Yad.yos Jun 15 '21 at 14:32
  • Let me help you. which part is not working eaxctly? I am using your code only, just switched things in your for loop and wrote data to files, nothing else. Everything else was your code only. – Ank Jun 15 '21 at 14:36
  • That's probably your editor. This is a common problem in python when tabs and spaces get mixed in editor. Never encountered this before? You need to convert all tabs to spaces in your code within your editor. Have a look: [inconsistent use of tabs and spaces in indentation](https://stackoverflow.com/questions/5685406/inconsistent-use-of-tabs-and-spaces-in-indentation?rq=1) – Ank Jun 15 '21 at 14:45
  • Thank you very much for that. this problem is solved. i just have to integrate gzip to unzip the output files. the output files are several files in fprmat .zg So i would like to unzip all the files then put them in a folder that will take the name of the species (input) Please i don't know how i can add gzip(or other) in the line "sequence = wget.download(urllib.parse.urlunparse(ret), out=espece)" Translated with www.DeepL.com/Translator (free version) – Yad.yos Jun 15 '21 at 14:58
  • What's the content inside those gz files? I mean what type of data is in it, is it plain text data? Can you paste one of the links here? To decompress, need to know what type of data is in them. – Ank Jun 15 '21 at 15:08
  • These are fasta files – Yad.yos Jun 15 '21 at 15:18
  • Sorry I am not aware on how to decompress fasta files. Better to post it as new question on how to unzip fasta files with python and someone should help you out. Since your question in this post was to get you through downloading files from urls (without stopping when urls don't exist), I suppose my answer works for you in that regard. – Ank Jun 15 '21 at 15:28
  • @Yad.yos Welcome. You can mark the answer as accepted if it solved the problem you stated in your question. Let me know if you have any more questions besides the unzipping part. Sorry couldn't help you there. – Ank Jun 17 '21 at 14:39