0

I am having trouble finding a solution to utilize linux sort command as an input to my python script.

For example I would like to iterate through the result of sort -mk1 <(cat file1.txt) <(cat file2.txt))

Normally I would use Popen and iterate through it using next and stdout.readline(), something like:

import os
import subprocess

class Reader():
    def __init__(self):
        self.proc = subprocess.Popen(['sort -mk1', '<(', 'cat file1.txt', ')', '<(', 'cat file2.txt', ')'], stdout=subprocess.PIPE)

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            line = self.proc.stdout.readline()
            if not line:
                raise StopIteration
            return line


p = Reader()
for line in p:
    # only print certain lines based on some filter 

With the above, I would get an error: No such file or directory: 'sort -mk1'

After doing some research, I guess I cant use Popen, and have to use os.execl to utilize bin/bash

So now I try below:

import os
import subprocess

class Reader():
    def __init__(self):
        self.proc = os.execl('/bin/bash', '/bin/bash', '-c', 'set -o pipefail; sort -mk1 <(cat file1.txt) <(cat file2.txt)')

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            line = self.proc.stdout.readline()
            if not line:
                raise StopIteration
            return line


p = Reader()
for line in p:
    # only print certain lines based on some filter 

The problem with this is that it actually prints all the lines right away. I guess one solution is to just pipe its results to a file, then in python I iterate through that file. But I dont really want to save it to a file then filter it, seems unneccessary. Yes I can use other linux commands such as awk, but I would like to use python for further processing.

So questions are:

  1. Is there a way to make solution one with Popen to work?
  2. How can I iterate through the output of sort using the second solution?
user1179317
  • 2,693
  • 3
  • 34
  • 62
  • Process Subtituion ( `<( command )` ) is something provided by bash (running a command, create a FIFO and substitute it as the name of the FIFO). If you feed these as argument to `sort`, it won't be able to do what you want (quite likely `sort` is going to treat `<(` and `)` as filenames). Why can't you simply do `sort -mk filename1.txt filename2.txt` ? – Adrian Shum Oct 05 '20 at 04:34
  • For your second case, using `os.exec*` is going to replace the whole process, so it will not continue to your next statements in your python script, hence it does not make sense to handle the output. Haven't tried but why can't you use `Popen` to spawn a process running `bash` as in your second example? – Adrian Shum Oct 05 '20 at 04:42
  • I guess I am not sure how to use Popen to spawn running bash – user1179317 Oct 05 '20 at 04:44

3 Answers3

1

If you want to use shell features, you have to use shell=True. If you want to use Bash features, you have to make sure the shell you run is Bash.

        self.proc = subprocess.Popen(
            'sort -mk1 <(cat file1.txt) <(cat file2.txt)',
            stdout=subprocess.PIPE,
            shell=True,
            executable='/bin/bash')

Notice how with shell=True the first argument to Popen and friends is a single string (and vice versa; if you don't have shell=True you have to parse the command line into tokens yourself).

Of course, the cats are useless but if you replace them with something which the shell performs easily and elegantly and which you cannot easily replace with native Python code, this is probably the way to go.

In brief, <(command) is a Bash process substitution; the shell will run command in a subprocess, and replace the argument with the device name of the open file handle where the process generates its output. So sort will see something like

sort -mk /dev/fd/63 /dev/fd/64

where /dev/fd/63 is a pipe where the first command's output is available, and /dev/fd/64 is the read end of the other command's standard output.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Okay this works. Thanks a lot. Quick question though, how come I've been seeing a lot to avoid using shell=True, why is that? – user1179317 Oct 05 '20 at 13:06
  • See https://stackoverflow.com/questions/3172470/actual-meaning-of-shell-true-in-subprocess – tripleee Oct 05 '20 at 13:27
0

Quite a lot of problem in your scripts.

First, your Popen won't work because of several reasons:

  1. First argument is supposed to be the command to run, and you passed sort -mk and there is no such file. You should simply pass sort, and pass -mk as arguments.
  2. Process substituion <( command ) is something handled by the shell, for which it is doing something like running a command, create a FIFO and substitute it as the name of the FIFO. Passing these directly to sort is not going to work. sort will probably just treat <( as a filename.

Your second way using os.exec* won't work either because os.exec* is going to replace your current process. Hence it will never continue to next statement in your Python script.

In your case, there seems no reason using process substitution. Why can't you simply do somethng like subprocess.Popen(['sort', '-mk', 'filename1', 'filename2']) ?

Adrian Shum
  • 38,812
  • 10
  • 83
  • 131
  • Lets just say I have to use process substitution, because its not really just reading a file. I just did that here to simplify the question. In reality, its taking a zip file, unzipping it and doing other processes. – user1179317 Oct 05 '20 at 05:02
-1

I do not understand why you are doing sort -mk1 $(cat file), sort can operate on file. look at check_output. That will make your life simple

output=subprocess.check_output('ls')
for line in output:
    print(line)

you will, of course, have to deal with the exceptions, the man page has the details

Sedy Vlk
  • 565
  • 4
  • 12
  • I dont think I can use check_output on sort, atleast I tried and it wasnt working either – user1179317 Oct 05 '20 at 04:46
  • ofcource you can I just tried this ` out =check_output(['sort', '/etc/resolv.conf']).splitlines()` and then print it, works like a charm – Sedy Vlk Oct 05 '20 at 04:51
  • In your example it would, but I need to use process substitution. Yes in this problem it doesnt make sense, but I just made it like this to simplify the question. But I have to use process substitution and check_output wont work with that – user1179317 Oct 05 '20 at 05:08
  • Sure it will, under the same circumstances where `Popen` will. You have to use `shell=True` *and* `executable='/bin/bash'`. See my answer for details. The problem with `check_output` is that it will return all the lines in one go after the subprocess exits, instead of let you read one at a time as they are generated. – tripleee Oct 05 '20 at 07:48