0

I have trained a custom NER model with Stanford-NER. I created a properties file and used the -serverProperties argument with the java command to start my server (direction I followed from another question of mine, seen here) and load my custom NER model but when the server attempts to load my custom model it fails with this error: java.io.EOFException: Unexpected end of ZLIB input stream

The stderr.log output with error is as follows:

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called --- 
[main] INFO CoreNLP - setting default constituency parser 
[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz 
[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead 
[main] INFO CoreNLP - to use shift reduce parser download English models jar from: 
[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html 
[main] INFO CoreNLP -     Threads: 4 
[main] INFO CoreNLP - Liveness server started at /0.0.0.0:9000 
[main] INFO CoreNLP - Starting server... 
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0.0.0.0:80 
[pool-1-thread-3] INFO CoreNLP - [/127.0.0.1:35546] API call w/annotators tokenize,ssplit,pos,lemma,depparse,natlog,ner,openie 
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize 
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer. 
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit 
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos 
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec]. 
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma 
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse 
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...  [pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 12.297 (s) 
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [13.6 sec]. 
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator natlog 
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner 
java.io.EOFException: Unexpected end of ZLIB input stream
    at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240  
    at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)     
    at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)     
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)   
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)  
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2620)
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2636)     
    at java.io.ObjectInputStream$BlockDataInputStream.readDoubles(ObjectInputStream.java:3333)  
    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1920) 
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) 
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) 
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) 
    at edu.stanford.nlp.ie.crf.CRFClassifier.loadClassifier(CRFClassifier.java:2650) 
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1462) 
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1494)
    at edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2963)     
    at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:282)   
    at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifiers(ClassifierCombiner.java:266)  
    at edu.stanford.nlp.ie.ClassifierCombiner.<init>(ClassifierCombiner.java:141)   
    at edu.stanford.nlp.ie.NERClassifierCombiner.<init>(NERClassifierCombiner.java:128)     
    at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:121)    
    at edu.stanford.nlp.pipeline.AnnotatorFactories$6.create(AnnotatorFactories.java:273)   
    at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:152)  
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:451)    
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:154)   
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:145)   
    at edu.stanford.nlp.pipeline.StanfordCoreNLPServer.mkStanfordCoreNLP(StanfordCoreNLPServer.java:273)    
    at edu.stanford.nlp.pipeline.StanfordCoreNLPServer.access$500(StanfordCoreNLPServer.java:50)    
    at edu.stanford.nlp.pipeline.StanfordCoreNLPServer$CoreNLPHandler.handle(StanfordCoreNLPServer.java:583)    
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)     
    at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)   
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)     
    at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)   
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)     
    at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)  
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)  
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)  
    at java.lang.Thread.run(Thread.java:748) 

I have googled this error and most of what I read is in regards to an issue with Java from 2007-2010 where an EOFException is "arbitrarily" thrown. This information is from here.

"When using gzip (via new Deflater(Deflater.BEST_COMPRESSION, true)), for some files, and EOFException is thrown at the end of inflating. Although the file is correct, the bug is the EOFException is thrown inconsistently. For some files it is thrown, other it is not."

Answers to other peoples questions in regards to this error state that you have to close the output streams for the gzip...? Not entirely sure what that means and I don't know how I would execute that advice as Stanford-NER is the software creating the gzip file for me.

Question: What actions can I take to eliminate this error? I am hoping this has happened to others in the past. Also looking for feedback from @StanfordNLPHelp as to whether there have been similar issues risen in the past and if there is something being done/something that has been done to the CoreNLP software to eliminate this issue. If there is a solution from CoreNLP, what files do I need to change, where are these files located within the CoreNLP framework, and what changes do I need to make?

ADDED INFO (PER @StanfordNLPHelp comments):

My model was trained using the directions found here. To train the model I used a TSV as outlined in the directions which contained text from around 90 documents. I know this is not a substantial amount of data to train with but we are just in the testing phases and will improve the model as we acquire more data.

With this TSV file and the Standford-NER software I ran the command below.

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

I then was had my model built and was even able to load and successfully tag a larger corpus of text with the ner GUI that comes with the Stanford-NER software.

During trouble shooting why I was unable to get the model to work I also attempted to update my server.properties file with the file path to the "3 class model" that comes standard in CoreNLP. Again it failed with the same error.

The fact that both my custom model and the 3 class model both work in the Stanford-NER software but fail to load makes me believe my custom model is not the issue and that there is some issue with how the CoreNLP software loads these models through the -serverProperties argument. Or it could be something I am completely unaware of.

The properties file I used to train my NER model was similar to the on in the directions with the train file changed and the output file name changed. It looks like this:

# location of the training file
trainFile = custom-model-trainingfile.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = custome-ner-model.ser.gz

# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1

# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only 
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

My server.properties file contained only one line ner.model = /path/to/custom_model.ser.gz

I also added /path/to/custom_model to the $CLASSPATH variable in the start up script. Changed line CLASSPATH="$CLASSPATH:$JAR to CLASSPATH="$CLASSPATH:$JAR:/path/to/custom_model.ser.gz. I am not sure if this is a necessary step because I get prompted with the ZLIB error first. Just wanted to include this for completeness.

Attempted to "gunzip" my custom model with the command gunzip custom_model.ser.gz and got a similar error that I get when trying to load the model. It is gzip: custom_model.ser.gz: unexpected end of file

Community
  • 1
  • 1
Fraizier Reiland
  • 147
  • 1
  • 11
  • @ChristopherManning You obviously know quite a bit about CoreNLP and I have seen that you tend to answer Error related questions. Have you seen this before? – Fraizier Reiland May 16 '17 at 17:45
  • Have you ever actually successfully run your trained model? Could you provide some details about how you trained your new ner model...for instance the command and properties file used? If you are getting an error like this it makes me thinks something is wrong with the trained model file itself. – StanfordNLPHelp May 17 '17 at 06:23
  • Also have you tried gunzip'ing the file at the command line? I don't think the file has to be gzipped to work. So you could try loading the non-gzipped version. – StanfordNLPHelp May 17 '17 at 06:26
  • @StanfordNLPHelp I did not try to 'gunzip' my file. I did not want to stray from the directions. I will give that a shot. I added more information per your request. Please see edited question. Thank you. – Fraizier Reiland May 17 '17 at 12:12
  • Most likely a corrupt model file. I have seen this happen when there was not enough disk space left to store the entire model during the training process. Whatever the reason, trying to gunzip the model is always a good first step to check if the model corruption is the culprit. – demongolem Oct 14 '17 at 01:00

1 Answers1

0

I'm assuming you downloaded Stanford CoreNLP 3.7.0 and have a folder somewhere called stanford-corenlp-full-2016-10-31. For the sake of this example let's assume it's in /Users/stanfordnlphelp/stanford-corenlp-full-2016-10-31 (change this to your specific situation)

Also just to clarify, when you run a Java program, it looks in the CLASSPATH for compiled code and resources. A common way to set the CLASSPATH is to just set the CLASSPATH environment variable with export command.

Typically Java compiled code and resources are stored in jar files.

If you look at stanford-corenlp-full-2016-10-31 you'll see a bunch of .jar files. One of them is called stanford-corenlp-3.7.0-models.jar. You can look at what's inside a jar file with this command: jar tf stanford-corenlp-3.7.0-models.jar.

You'll notice when you look inside that file that there are (among others) various ner models. For instance you should see this file:

edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz

in the models jar.

So a reasonable way for us to get things working is to run the server and tell it to only load 1 model (since by default it will load 3).

  1. run these commands in one window (in the same directory as the file ner-server.properties)

    export CLASSPATH=/Users/stanfordnlphelp/stanford-corenlp-full-2016-10-31/*:
    
    java -Xmx12g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -serverProperties ner-server.properties
    

with ner-server.properties being a 2-line file with these 2 lines:

annotators = tokenize,ssplit,pos,lemma,ner
ner.model = edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz

The export command above is putting EVERY jar in that directory on the CLASSPATH. That is what the * means. So stanford-corenlp-3.7.0-models.jar should be on the CLASSPATH. Thus when the Java code runs, it will be able to find edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz.

  1. In a different Terminal window, issue this command:

    wget --post-data 'Joe Smith lives in Hawaii.' 'localhost:9000/?properties={"outputFormat":"json"}' -O -
    

When this runs, you should see in the first window (where the server is running) that only this model is loading edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz.

You should note that if you deleted the ner.model from your file and re-did all of these, 3 models would load instead of 1.

Please let me know if that all works or not.

Let's assume I made an NER model called custom_model.ser.gz , and that file is what StanfordCoreNLP output after the training process. Let's say I put it in the folder /Users/stanfordnlphelp/.

If steps 1 and 2 worked, you should be able to alter ner-server.properties to this:

annotators = tokenize,ssplit,pos,lemma,ner
ner.model = /Users/stanfordnlphelp/custom_model.ser.gz

And when you do the same thing, it will show your custom model loading. There should not be any kind of gzip issue. If you are still having a gzip issue, please let me know what kind of system you are running this on? Mac OS X, Unix, Windows, etc...?

And to confirm, you said that you have run your custom NER model with the standalone Stanford NER software right? If so, that sounds like the model file is fine.

StanfordNLPHelp
  • 8,699
  • 1
  • 11
  • 9
  • I was able to successfully load my custom model. I found out through another question on a different stack site that there are issues when you create a gzip file on one OS (in my case Windows) and try to utilize that gzip on another OS (in my case Linux). I did not get the error when I loaded my model on my windows system. **Biggest takeaway is to create the model on the same OS you plan to load it on.** Seems like common sense but now we know. Thanks again. – Fraizier Reiland May 18 '17 at 12:48