1

I have the following Java command line working fine Mac os.

java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file.txt > output.txt

Multiple files can be passed as input with spaces as follows.

java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file1.txt file2.txt > output.txt

Now I have 100 files in a folder. All these files I have to pass as input to this command. I used

python os.system in a for loop of directories as follows .

for i,f in enumerate(os.listdir(filedir)):

     os.system('java -cp "stanford-ner.jar" edu.stanford.nlp.process.PTBTokenizer "%s" >        "annotate_%s.txt"' %(f,i))

This works fine only for the first file. But for all othe outputs like annotate_1,annotate_2 it creates only the file with nothing inside that. I thought of using for loop the files and pass it to subprocess.popen() , but that seems of no hope.

Now I am thinking of passing the files in a loop one by one to execute the command sequentially by passing each file in a bash script. I am also wondering whether I can parallely executes 10 files (atleast) in different terminals at a time. Any solution is fine, but I think this question will help me to gain some insights into different this.

user3368375
  • 57
  • 3
  • 12
  • 1
    First, you really shouldn't be using `os.system`. You've added the `subprocess` tag, which is about using the `subprocess` module. Use that; then you can pass multiple arguments just by passing a list with multiple elements, without having to worry about how to quote them or join them up or anything. – abarnert Dec 18 '14 at 10:40
  • 1
    Also, unlike `os.system`, `subprocess` lets you start a process up and then check whether it's finished later, instead of waiting for each one to finish before you can do the next. – abarnert Dec 18 '14 at 10:41
  • @abarnert I dint get the last part . But I will see the first part right away . – user3368375 Dec 18 '14 at 10:47

4 Answers4

1

If you want to do this from the shell instead of Python, the xargs tool can almost do everything you want.

You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The --max-args option sets the size of the biggest group. If you want to run things in parallel, the --max-procs option lets you do that.

But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:

ls | xargs --max-args=10 --max-procs=10 java -cp stanford-ner.jar\
    edu.stanford.nlp.process.PTBTokenizer >> output.txt
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • hey man it worked. This was very useful from your part. thanks alot . Thanks . – user3368375 Dec 18 '14 at 11:26
  • upvoted for xargs. Are you sure that `>> output.txt` and `> output.txt` differ in this case (assuming there is no output.txt before the `xargs` is run)? Also the output might be garbled if `max-procs > 1`. – jfs Dec 18 '14 at 14:08
0

Inside your input file directory you can do the following in bash:

#!/bin/bash
for file in *.txt
do
    input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > output.txt

If you want to run it as a script. Save the file with some name, my_exec.bash:

#!/bin/bash
if [ $# -ne 2 ]; then
    echo "Invalid Input. Enter a directory and a output file"
    exit 1
fi
if [ ! -d $1 ]; then
    echo "Please pass a valid directory"
    exit 1
fi
for file in $1*.txt
do
    input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > $2

Make it an executable file

chmod +x my_exec.bash

USAGE:

 ./my_exec.bash <folder> <output_file>
pratZ
  • 3,078
  • 2
  • 20
  • 29
  • Hey . Where should i mention my directory path . Should it be in quotes ? – user3368375 Dec 18 '14 at 10:56
  • This is a really complicated way of doing the same thing as just passing `*.txt` on the command line (but less robustly—e.g., if there are quotes or other special characters in any of the filenames, this will break, but `*.txt` will not). – abarnert Dec 18 '14 at 11:00
  • a file name will not have special characters. As far as spaces are concerned the double quotes will handle that. – pratZ Dec 18 '14 at 11:05
  • @pratZ: But you don't _need_ any of that. In every case this whole mess works, `*.txt` works too. And in many cases where this mess _doesn't_ work, `*.txt` works. So why do this? – abarnert Dec 18 '14 at 11:44
  • @pratZ: Also, how do you know "a file name will not have special characters". Filenames are allowed to have quotes in them on most platforms. On Linux, they're even allowed to have control characters and bytes that aren't part of a valid character in the relevant encoding. – abarnert Dec 18 '14 at 11:45
0

If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.

First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:

files = os.listdir(filedir)
groups = [files[i:i+10] for i in range(0, len(files), 10)]

Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with os.system, which is one of the many reasons os.system says "The subprocess module provides more powerful facilities for spawning new processes…"

procs = [subprocess.Popen(…) for group in groups]
for proc in procs:
    proc.wait()

So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just ' '.join them. But otherwise, it's a nightmare. Another reason subprocess is better: you can just pass a list of arguments:

procs = [subprocess.Popen(['java', '-cp', 'stanford-ner.jar',
                           'edu.stanford.nlp.process.PTBTokenizer'] + group)
         for group in groups]

But now how to do you get all of the results?

One way is to go back to using a shell command line with the > redirection. But a better way is to do it in Python:

procs = []
files = []
for i, group in enumerate(groups):
    file = open('output_{}'.format(i), 'w')
    files.append(file)
    procs.append(subprocess.Popen([…same as before…], stdout=file)
for proc in procs:
    proc.wait()
for file in files:
    file.close()

(You might want to use a with statement with ExitStack, but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicit close.)

abarnert
  • 354,177
  • 51
  • 601
  • 671
0

To pass all .txt files in the current directory at once to the java subprocess:

#!/usr/bin/env python
from glob import glob
from subprocess import check_call

cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
with open('output.txt', 'wb', 0) as file:
    check_call(cmd + glob('*.txt'), stdout=file)

It is similar to running the shell command but without running the shell:

$ java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer *.txt > output.txt

To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use multiprocessing.pool.ThreadPool:

#!/usr/bin/env python
from glob import glob
from multiprocessing.pool import ThreadPool
from subprocess import call
try:
    from threading import get_ident # Python 3.3+
except ImportError: # Python 2
    from thread import get_ident

cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
def run_command(files):
    with open('output%d.txt' % get_ident(), 'ab', 0) as file:
        return files, call(cmd + files, stdout=file)

all_files = glob('*.txt')
file_groups = (all_files[i:i+100] for i in range(0, len(all_files), 100))
for _ in ThreadPool(10).imap_unordered(run_command, file_groups):
   pass

It is similar to this xargs command (suggested by @abarnert):

$ ls *.txt | xargs --max-procs=10 --max-args=100 java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >>output.txt

except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670