0

I'm doing a bioinformatic study, where I process some data and get some outputs into some desired folders. The folder/file structure looks like this, for two of the folders:

binned/90-20-09-2018/bins/90-20-09-2018.001, 90-20-09-2018.002, 90-20-09-2018.003 and so forth

binned/90-25-04-2018/bins/90-25-04-2018.001, 90-25-04-2018.002, 90-25-04-2018.003 and so forth

I know the amount of folders, but the amount of files in the folders, is unknown and will vary.

In another file called taxonomy (eg. binned/90-20-09-2018/bins/quality/taxonomy.txt) is a table of bacterial names for each of the bins (the files named 90-20-09-2018.001, 90-20-09-2018.002 etc.). As you can see, for each bin ID is a corresponding Taxonomy.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id              # unique markers (of 43)   # multi-copy   Taxonomy                                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  90-20-09-2018.001              25                   15        k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus          
  90-20-09-2018.003              24                   0         k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus          
  90-20-09-2018.002              15                   0         k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae_2;g__Lactobacillus_2      
  90-20-09-2018.005              14                   11        k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae                           
  90-20-09-2018.004              12                   0         k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae;g__Mobiluncus  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

What I need, is to rename each of the bin files (90-20-09-2018.001, 90-20-09-2018.002 etc.) into their corresponding taxonomy (genus) name. The genus name is the name that comes after "g". so for bin 001, it would be "Lactobacillus".

So the final result would look like this (for the first folder).

binned/90-20-09-2018/bins/Lactobacillus, Lactobacillus_2, Streptococcus and so forth

I imagine this being done with Python (the only programming language I'm familiar with) Feel free to ask questions if I've been too unclear.

Thanks!

  • 1
    Can you show us what you have tried so far? – Nils Oct 30 '19 at 10:57
  • I must be honest, I have no clue how to approach it. But, i failed to mention that i would prefer it being in Python. I'm keen on taking advice as to how to begin. I have only been working with Python for data-manipulation of text-files. – mortalknight55hotmailcom Oct 30 '19 at 11:00
  • You can change a folder name with os.rename() and you can split a string via ''.split('g__'). https://stackoverflow.com/questions/8735312/how-to-change-folder-names-in-python https://www.tutorialspoint.com/python/string_split.htm That should help you to start ;) – Nils Oct 30 '19 at 11:04
  • Also, this will help: https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory Just try around a bit and if you then run into problems. Just add a comment and I will help you ;) – Nils Oct 30 '19 at 11:06
  • Thanks! And how would I decide what file should be named what? I guess I have to match the current file name with the "bin ID" in the taxonomy-file to let python know which file I'm renaming, right? – mortalknight55hotmailcom Oct 30 '19 at 11:07
  • 1
    Thanks, I will take a look at it, and return to you if in doubt! – mortalknight55hotmailcom Oct 30 '19 at 11:07
  • Yes, you can just iterate through your list of folders and match it with your list/search in your list for it. – Nils Oct 30 '19 at 11:08
  • If you are able to come up with a solution yourself please post it as an answer so if someone else tumbles upon this post. There is a solution to the problem. – Nils Oct 30 '19 at 11:10
  • could you include like a google drive or dropbox link to some example files? When building something like this it helps to be working with the actual file structures you're importing – Michael Green Oct 30 '19 at 12:18
  • any update here? wanna try and help out, just need the appropriate data structure. – Michael Green Oct 31 '19 at 15:26
  • Hi Michael, I still struggle with it and would definitely appreciate some help. Below is a link to a dropbox. The folder structure and files in them are just how I have them. The only difference is, that the "bin" files are empty (as their contents are not relevant, and are also quite large). The taxonomy file should just be opened with some kind of text-editor. https://www.dropbox.com/sh/ho3ux5wplv7yk4u/AAAuqLOFKtxXr3KTwGqca778a?dl=0 – mortalknight55hotmailcom Nov 01 '19 at 07:11
  • I made a new thread, with a more specific problem (I figured, I also need to change the headers of the fasta-files). I also updated the files and folder LINK. https://stackoverflow.com/questions/58673129/replacing-headers-in-fasta-file-and-replacing-filename-with-string-from-separat – mortalknight55hotmailcom Nov 02 '19 at 16:44

1 Answers1

0

So here's what I got for ya:

import pandas as pd
import glob
from os.path import split, splitext
from os import rename

directory = r'D:\Research and Teaching\ZZ General\Python\binned\90-20-09-2018'

fastas = r'\bins\*.fasta'

taxonomy = r'\quality\*.txt'

fasta_dir = {splitext(split(fasta_file)[1])[0]: fasta_file 
             for fasta_file in glob.glob(directory+fastas)}

tax = pd.read_table(glob.glob(directory+taxonomy)[0]).to_numpy()

data = {count: [item for item in tax[count][0].split(' ') if item != '']
        for count, line in enumerate(tax)}

files = {data[item][0]: data[item][-1].split(';')[-1] 
        for item in data if data[item][0] != data[item][-1]}

for key in fasta_dir:
    rename(fasta_dir[key], split(fasta_dir[key])[0]+'\\'+files[key]+r'.fasta')

basically what we're doing is we're creating dictionaries of the input file string and the genera from the taxonomy file (actually, the most precise taxonomy because, as can be seen, sometime your resolution only goes to the family), and tying those dictonaries to the 'os.rename' command, which is doing the name swap for us.

this should work for any bin folder as long as they have the same structure: I.E. bins and quality -> fasta files and taxonomy file. Just point the directory to the appropriate bin. Also, the import is dependent on the taxonomy file being specified as a .txt file so if that's not explicit in your OS then you'll need to rename it.

Here's what I got as the result:

enter image description here

Michael Green
  • 719
  • 6
  • 15