1

I have this script (below) that segments a large file into smaller files within a directory, and once a quota is reached for a directory a new directory starts with another quota.

The large file (L.pdbqt) contains multiple molecules, and it gets segmented into files with single molecules. Since this work will be performed using multiple CPUs, the single molecules are divided equally onto multiple directories such that each directory has the collection of files that it will compute.

import os
import itertools

def split(filename, direct, limit):
    with open(filename) as infile:
        count = 0
        in_dir_count = 0
        dircount = 0
        for dircount in itertools.count():
            for line in infile:
                if line.strip().split()[0] == 'MODEL' and line.strip().split()[1] == '{}'.format(count+1):
                    directory = os.path.join(direct, '{}'.format(dircount+1))
                    os.makedirs(directory, exist_ok=True)
                    name = '{}.pdbqt'.format(count+1)
                    out = os.path.join(directory, name)
                    with open(out, 'w') as outfile:
                        for line in infile:
                            if line.strip() == 'ENDMDL': break
                            if line.split()[0] == 'REMARK' and line.split()[1] == 'Name':
                                NewName = os.path.join(directory, '{}.pdbqt'.format(line.split()[3]))
                            outfile.write(line)
                    os.rename(out, NewName)
                    count += 1
                    in_dir_count += 1
                    if in_dir_count >= limit:
                        in_dir_count = 0
                        print('[+] Finished directory {}'.format(directory))
                        break
            else: break
    print('----------\n[+] Done')
split('L.pdbqt', 'Ligands', 20)

The L.pdbqt file can be downloaded from here: https://www.dropbox.com/s/2dzu0k0zkw0uumn/L.pdbqt?dl=0

In this "example" there are only 120 molecules in the L.pdbqt file and the scripts segment them such that each directory has only 20 molecules. But in the "actual" full run the L.pdbqt file will contain 100,000,000 molecules and each directory will contain 300,000 individual molecule files.

The Problem: When performing the "actual" full run of 100,000,000 molecules, the script fails at "random" points by giving

Traceback (most recent call last):
  File "segment.py", line 31, in <module>
    split('below.pdbqt', 'Ligands', 300000)
  File "segment.py", line 21, in split
    outfile.write(line)
OSError: [Errno 5] Input/output error

Every time this happens at a random point, so I cannot replicate it easily, and the breaking points are several hundreds of thousands of molecules into the script (sometimes after 9,000 molecule [3 directories] and sometimes after 21,000 molecules [7 directories] etc...), so it cannot be replicated with the example file (which completes perfectly).

The Question: How can I make my script more efficient such that it is more robust to breaking, and handles these very large volume input/output operations efficiently without breaking?

AC Research
  • 71
  • 1
  • 4
  • Are you running on Windows? If so, it could be related to [Python rapidly creating and removing directories will cause WindowsError Error 5 intermittently](https://stackoverflow.com/questions/32243199/python-rapidly-creating-and-removing-directories-will-cause-windowserror-error) – DarrylG Jun 09 '20 at 09:52
  • actually i am on linux – AC Research Jun 09 '20 at 16:52

0 Answers0