0

I have millions of short input files. PyLauncher will run on supercomputers, running millions of python scripts in parallel. Each runs a program on each input and copies 2 lines from the output of each, then appends those 2 lines to results.txt. The python script looks like:

for input_file in directory:
 subprocess.run(["script_name input_file | sed -n '22p; 39p' | tee -a results.txt"], shell=True)

results.txt will have 2*num_input_files (millions) of lines like:

Ligand: ./input/ZINC00001677.pdbqt
1       -8.288          0          0
Ligand: ./input/ZINC00001567.pdbqt
1       -10.86          0          0
Ligand: ./input/ZINC00001601.pdbqt
1       -7.721          0          0

I'd like to take this, rearrange, drop the 1, 0, and 0 from line 2, and sort so most negative number comes first so it looks like:

-10.86 ZINC00001567.pdbqt
-8.288 ZINC00001677.pdbqt
-7.721 ZINC00001601.pdbqt

I found this StackOverflow question: How do I sort two lines at a time in bash, using the second line as index?

But I can't quite get the commands to work for my file. Speed of execution is very important, so Bash commands or Python could both work, depending on which is faster. Thanks in advance!

darrowboat
  • 33
  • 4
  • It's very easy to do but in order to sort the data you'll have to have everything in memory. Is that going to be a constraint? – DarkKnight Jan 19 '23 at 15:50
  • I am not sure about that. This will be run on very fast supercomputers. To get the results file that I quoted above, PyLauncher will run the same script for all million+ files that runs a program on the input file, copies 2 lines from its output, and appends them to a results.txt. – darrowboat Jan 19 '23 at 15:54
  • So you have millions of files and each file contains millions of lines. Is that right? – DarkKnight Jan 19 '23 at 15:58
  • No, sorry. I have millions of short input files. A python script runs a program on each input and copies 2 lines from the output of each. Then appends those 2 lines to results.txt, which will have 2*num_input_files lines. – darrowboat Jan 19 '23 at 16:03
  • 1
    Your question now contradicts your comments. Please rewrite the question stating **exactly** what you have and what you need. You might also want to qualify what you mean by a "negative sort". What you've shown appears to be a normal floating point order – DarkKnight Jan 19 '23 at 16:05
  • @Pingu thank you. I have edited my question to be as precise as I can. I also ran your Python script and it works for what I need. Do you think this would be as fast/faster than running Bash commands? – darrowboat Jan 19 '23 at 16:17
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251269/discussion-between-pingu-and-darrowboat). – DarkKnight Jan 19 '23 at 16:50

2 Answers2

1

In python I would do something like this:

with open('input.txt', 'r') as f_inp, open('output.txt', 'w') as f_out:
    while True:
        one = f_inp.readline().strip('\n')
        if not one:
            break
        two = f_inp.readline().strip('\n')
        f_out.write(f'{two} - {one}\n')

Then I would leave it to the sort command to do the sort part.

stenci
  • 8,290
  • 14
  • 64
  • 104
1

If you have enough RAM to store the output file contents then you could do this:

from os.path import basename

INPUTFILE = 'verylargefile.txt'
OUTPUTFILE = 'results.txt'

result = []

with open(INPUTFILE) as data:
    while line := data.readline():
        filename = basename(line.split()[-1])
        v = data.readline().split()[1]
        result.append(f'{v} {filename}\n')


with open(OUTPUTFILE, 'w') as data:
    data.writelines(sorted(result, key=lambda x: float(x.split()[0])))
DarkKnight
  • 19,739
  • 3
  • 6
  • 22