0

I tired this:

for dir in /home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2007/02/*/
    for f in *.xml ; do
        echo $f | grep -q '_output\.xml$' && continue # skip output files
        g="$(basename $f .xml)_output.xml"
        java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
    done
done

which is based on the answer to this question, but that didn't work.

I have a folder stucture such that within the directory NYTimesCorpus there is a directory 2007 and within that a directory 01 and also 02, 03, and so on...

then within 01 there is again 01,02,03,...

in each of these terminal directories there are many .xml files to which I want to apply the script:

for f in *.xml ; do
    echo $f | grep -q '_output\.xml$' && continue # skip output files
    g="$(basename $f .xml)_output.xml"
    java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile $f -outputFormat inlineXML > $g
done

but there are so many different directories that running it within each dirctory is a form of rare torture. apart from 2007 I also have 2006 and 2005, so ideally what I would like to do is run it once and have the program just navigate that structure on its own.

My attempts this far have not been successful, perhaps one among you would know how to achieve this?

Thank you for your consideration.

UPDATE

textFile=./scrypt.sh
outputFormat=inlineXML
Loading classifier from /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz ... done [2.2 sec].
CRFClassifier tagged 71 words in 5 documents at 959.46 words per second.
CRFClassifier invoked on Sun Apr 12 19:33:34 HKT 2015 with arguments:
   -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile ./scrypt.sh -outputFormat inlineXML
    loadClassifier=/home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz
Community
  • 1
  • 1
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72

2 Answers2

2

I would use find since it works recursively:

find /path/to/xmls -type f ! -name '*_output.xml' -name '*.xml' -exec ./script.sh {} \;

For better readability I would save the actions that should be executed on each file to script.sh:

#!/bin/bash

f="$1"
g="${f%%.*}_output.xml"
java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /home/matthias/Workbench/SUTD/nytimes_corpus/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile "$f" -outputFormat inlineXML > "$g"

and make it executable:

chmod +x script.sh
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • so in /path/to/xmls do I have to specify each directory? – smatthewenglish Apr 12 '15 at 10:42
  • so `/path/to/xmls` can just be the, relative, root directory, i.e. `/home/matthias/Workbench/SUTD/nytimes_corpus/` is that it? – smatthewenglish Apr 12 '15 at 10:45
  • Yes, in your example this should be the correct path – hek2mgl Apr 12 '15 at 10:46
  • i just tried it but it didn't work. i got the error `Exception in thread main edu.stanford.nlp.io.RuntimeIOException: java.io.FileNotFoundException: *.xml (No such file or directory) ` – smatthewenglish Apr 12 '15 at 10:50
  • What is the output of `find /path/to/xmls -type f -name '*.xml'` ? – hek2mgl Apr 12 '15 at 10:55
  • i did `find . -type f -name '*.xml'` which was also how I ran the original command you recommended, with a `.` and that one I just did, that you recommended here in the comments, it displayed all the .xml files correctly, in `./29/1645744.xml ` that way for all files – smatthewenglish Apr 12 '15 at 10:57
  • The script by @hek2mgl needs an edit. Replace the assignment to `g` by: `g="$(dirname $f)/$(basename $f .xml)_output.xml"` – Abhay Apr 12 '15 at 11:04
  • it seems not to have worked, the program terminated and there were no output files in the respective folders, maybe i have to specify the `dirname` for the correct number of directories until the `.xml` files? – smatthewenglish Apr 12 '15 at 11:15
  • It should be `g="${f%%.*}_output.xml"`. Edited that – hek2mgl Apr 12 '15 at 11:21
  • I think it's not working, since I don't see the output files anywhere, I posted what is being written to the console under the **UPDATE** section of my original question, it's also writing an empty file called `_output.xml` to where ever I call it from – smatthewenglish Apr 12 '15 at 11:37
  • @hek2mgl no! that's not what's happening at all! what's happening is a file called `_output.xml` seems to be being written after it reads an entire directory and then it goes and reads a new directory, erases that file and starts over. – smatthewenglish Apr 12 '15 at 11:42
  • The command above won't do that. Are you using the command *exactly* as I posted it? – hek2mgl Apr 12 '15 at 11:56
  • I did exactly as you had posted. it seems that it was doing that, such as there was a file called `_output.xml` and the time stamp on it kept changing and getting more recent, it would have a big size and then the size would be zero – smatthewenglish Apr 12 '15 at 12:00
  • It should be `f=$1` .. Changed that. Please try it again – hek2mgl Apr 12 '15 at 12:04
  • i tried also that and the behaviour is still as I have described before – smatthewenglish Apr 12 '15 at 12:11
  • i accepted the other answer becasue it seems to work but I want to tell you that I am very grateful for your help and even it's clear that glennjackman cited your answer as using `find` as a good solution so anyway I upvoted your answer and i appreciate your efforts. thank you. – smatthewenglish Apr 12 '15 at 13:24
  • Thanks and you are welcome. I'm just wondering why it doesn't work. It should! I've even created a test setup and everything was working fine. Sure I cannot emulate that java process, I just did `cp "$f" "g"` and it worked as expected. The java process shouldn't be the problem.. I cannot look at your terminal, but I would debug it step by step. Is `$f` ok? Is `$g` ok? Is every file listed properly? .. Then you should get it working – hek2mgl Apr 12 '15 at 13:33
1

find is a good solution. It sounds like all the xml files are at the same directory depth, so try this:

dir=/home/matthias/Workbench/SUTD/nytimes_corpus
for f in $dir/NYTimesCorpus/*/*/*/*.xml; do
    [[ $f == *_output.xml ]] && continue # skip output files
    g="${f%.xml}_output.xml"
    java -mx600m \
         -cp $dir/NER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar \
         edu.stanford.nlp.ie.crf.CRFClassifier \
         -loadClassifier $dir/NER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz \
         -textFile "$f" \
         -outputFormat inlineXML > "$g"
done

The glob pattern $dir/NYTimesCorpus/*/*/*/*.xml specifies that the wanted xml files are exactly 3 levels below NYTimesCorpus. That that is the wrong depth, then alter the number of */ in the pattern.

If the xml files can appear at varying depths, use find, or in bash use:

shopt -s globstar nullglob
for f in $dir/NYTimesCorpus/**/*.xml; do

reference

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • ok, first let me say: bravo. absolutely top-ranked. fantastic. hats off to you sir for this phenominal solution. and then follow that up by, what is the mechanism by which this works where the others did not? how could this be changes to accomodate a slightly different depth? – smatthewenglish Apr 12 '15 at 11:59