-1

For every input file processed (see code below) I am trying to use "os.path.basename" to write to a new output file - I know I am missing something obvious...?

import os
import glob
import gzip

dbpath = '/home/university/Desktop/test'

for infile in glob.glob( os.path.join(dbpath, 'G[D|E]/????/*.gz') ):
print("current file is: " + infile)

**

   outfile=os.path.basename('/home/university/Desktop/test/G[D|E]                              
/????/??????.xaa.fastq.gz').rsplit('.xaa.fastq.gz')[0]

  file=open(outfile, 'w+')

**

  gzsuppl = Chem.ForwardSDMolSupplier(gzip.open(infile))
  for m in gzsuppl:
  if m is None: continue
...etc

file.close()
print(count)

It is not clear to me how to capture the variable [0] (i.e. everything upstream of .xaa.fastq.gz) and use as the basename for the new output file? Unfortunately it simply writes the new output file as "??????" rather than the actual sequence of 6 letters. Thanks for any help given.

jnorth
  • 115
  • 7
  • Show a real example of a file path from which you would like to exact the basename from using `os.path.basename` and what you'd like the name of `outfile` to be if it were created correctly. Sorry, your code is somewhat worthless in its current state...it's not even indented properly. – martineau Nov 25 '17 at 01:08
  • Apologies, a real path could be /home/university/Desktop/test/GD /AAML/DEAAML.xaa.fastq.gz The output filename would be DEAAML.fastq.gz. The indentation is correct in my file but I had trouble getting it correct here. – jnorth Nov 25 '17 at 01:10
  • if `filepath = '/home/university/Desktop/test/GD /AAML/DEAAML.xaa.fastq.gz'` then `os.path.dirname(filepath)` would return `/home/university/Desktop/test/GD /AAML`. Is that what you want? You still havn't clearly described what the resulting `outfile`'s name should be. – martineau Nov 25 '17 at 01:16
  • Again apologies, I just updated my last comment to include the example output file. – jnorth Nov 25 '17 at 01:17
  • There are hundreds of folders with 2 characters and within each of those folders would be several dozen more subfolders (all 4 characters). The names of the files are 6 letters.xaa.fastq.gz. The script is meant to iterate over all of those folders, performing the function to each input file and outputting to its respective folder. – jnorth Nov 25 '17 at 01:20
  • Well, `os.path.basename(filepath)` returns `DEAAML.xaa.fastq.gz`, which is the same a the filename of the original. Unless you want to overwrite the original file, to put `outfile` in the same folder you will need to create a different name for it, or put it in a different directory folder where the name won't conflict with an existing file's name. – martineau Nov 25 '17 at 01:28
  • DEAAML.xaa.fastq.gz is one of several thousand input files throughout that directory structure. `os.path.basename(filepath)` returns "??????" when I use `'/home/university/Desktop/test/G[D|E] /????/??????.xaa.fastq.gz'` – jnorth Nov 25 '17 at 01:36
  • I guess the question should have been how to search through multiple directories with multiple files, grab the "basename" and only use that as the output filename after performing a basic function. This is fairly straightforward as a bash script but I am forced to use python in this case. – jnorth Nov 25 '17 at 01:44
  • OK, I think I understand now...working on it. – martineau Nov 25 '17 at 02:03
  • Thanks @martineau, my apologies again for it being unclear. – jnorth Nov 25 '17 at 02:08

2 Answers2

1

This seems like it will get everything upstream of the .xaa.fastq.gz in the paths returned from glob() in your sample code:

import os

filepath = '/home/university/Desktop/test/GD /AAML/DEAAML.xaa.fastq.gz'
filepath = os.path.normpath(filepath)  # Changes path separators for Windows.

# This section was adapted from answer https://stackoverflow.com/a/3167684/355230
folders = []
while 1:
    filepath, folder = os.path.split(filepath)
    if folder:
        folders.append(folder)
    else:
        if filepath:
            folders.append(filepath)
        break
folders.reverse()

if len(folders) > 1:
    # The last element of folders should contain the original filename.
    filename_prefix = os.path.basename(folders[-1]).split('.')[0]
    outfile = os.path.join(*(folders[:-1] + [filename_prefix + '.rest_of_filename']))
    print(outfile)  # -> \home\university\Desktop\test\GD \AAML\DEAAML.rest_of_filename

Of course what ends-up in outfile isn't the final path plus filename since I don't know what the remainder of the filename will be and just put a placeholder in (the '.rest_of_filename').

martineau
  • 119,623
  • 25
  • 170
  • 301
  • Wow @martineau, thank you! Give a few minutes to digest and incorporate :) – jnorth Nov 25 '17 at 02:34
  • Note I using Windows which is why there's back-slash path separators in the printed result instead of forward-slashes. – martineau Nov 25 '17 at 02:41
  • Again, thank you very much @martineau for all your help - I have learned a tremendous amount. Your code is extremely useful and here is what finally worked on my end with some hints from your code. `current file is: /home/university/test/GD/AAML/DEAAML.xaa.fastq.gz current file is: /home/university/test/GE/AARL/DEAARL.xaa.fastq.gz current file is: /home/university/test/GE/AARN/DEAARN.xab.fastq.gz etc... outfile=os.path.basename(infile).rsplit('.')[0] outfile2='/home/university/test/%s.sdf' % outfile file=open(outfile2, 'w+')` – jnorth Nov 25 '17 at 03:09
  • that first bit is an example of some of the file structure – jnorth Nov 25 '17 at 03:12
1

I'm not familiar with the kind of input data you're working with, but here's what I can tell you:

  1. The "something obvious" you're missing is that outfile has no connection to infile. Your outfile line uses the ?????? rather than the actual filename because that's what you ask for. It's glob.glob that turns it into a list of matches.

    Here's how I'd write that aspect of the outfile line:

    outfile = infile.rsplit('.xaa.fastq.gz', 1)[0]
    

    (The , 1 ensures that it'll never split more than once, no matter how crazy a filename gets. It's just a good habit to get into when using split or rsplit like this.)

  2. You're setting yourself up for a bug, because the glob pattern can match *.gz files which don't end in .xaa.fastq.gz, which would mean that a random .gz file which happens to wind up in the folder listing would cause outfile to have the same path as infile and you'd end up writing to the input file.

    There are three solutions to this problem which apply to your use case:

    1. Use *.xaa.fastq.gz instead of *.gzin your glob. I don't recommend this because it's easy for a typo to sneak in and make them different again, which would silently reintroduce the bug.

    2. Write your output to a different folder than you took your input from.

      outfile = os.path.join(outpath, os.path.relpath(infile, dbpath))
      
      outparent = os.path.dirname(outfile)
      if not os.path.exists(outparent):
          os.makedirs(outparent)
      
    3. Add an assert outfile != infile line so the program will die with a meaningful error message in the "this should never actually happen" case, rather than silently doing incorrect things.

  3. The indentation of what you posted could be wrong, but it looks like you're opening a bunch of files, then only closing the last one. My advice is to use this instead, so it's impossible to get that wrong:

    with open(outfile, 'w+') as file:
        # put things which use `file` here
    
  4. The name file is already present in the standard library and the variable names you chose are unhelpful. I'd rename infile to inpath, outfile to outpath, and file to outfile. That way, you can tell whether each one is a path (ie. a string) or a Python file object just from the variable name and there's no risk of accessing file before you (re)define it and getting a very confusing error message.

ssokolow
  • 14,938
  • 7
  • 52
  • 57
  • very useful feedback thank you. I will incorporate your suggestions most definitely. Good explanation for learning. – jnorth Nov 25 '17 at 03:27