1

Note to community: Please do not close this as duplicate because the particular issue I am researching has manifested as a null pointer exception. As you can see from the stack trace, the NPE is buried 4 layers deep in the Tika library. That means of all the great advice that was given in the existing StackExchange post on NPE, none of the Tika developers saw fit to apply that advice (checking for null pointers) in four modules. Rather than learn Tika and retrofit their code with a patch to do that work, it think it would be more efficient to ask if anyone had achieved the common use case of using the SourcCodeParser.

I am looking for help with a published example for the Tika library here. I did not author the example code. I have seen many similar questions relating to the Tika library, which has 20 contributors and thousands of lines of code. Please do not close this question as I believe this can be quickly easily answered by anyone who used this Parser before. I have already read the post on NullPointerException, and am following this advice from that question:

I still can't find the problem

If you tried to debug the problem and still don't have a solution, you can post a question for more help, but make sure to include what you've tried so far. At a minimum, include the stacktrace in the question, and mark the important line numbers in the code.

As I spent much time authoring this post, retrieving and including relevant stack trace and source code, I would really appreciate it if you would allow this to spend a little bit of time in an unclosed state so that someone who is familiar with Tika might take a look at what appears to be fairly common issue. As you would know as a Java expert, many null pointer exception issues can be non-trivial, particularly when working with a large unfamiliar framework. I really appreciate your help.

I wrote a simple program to test the Tika SourceCodeParser, by substituting it for the AutoDetectParser in the XHTML parsing example from the Tika Examples page. When executing the parse command on line 137, there is a NullPointerException. It appears that there may be a delegate missing from the in on line 180 of the Parser code.

The AutoDetectParser works but does not identify the source code as java.

When I use the Tika desktop app, it works fine and recognizes the code as Java.

How do I initialize the SourceCodeParser to avoid the NullPointerException when operating it?

Example using Tika "Example" Package LocalFile.toTikaXhtmlString()

123      /** Parses as Tika using source code parser.
124      *
125      * @param filePathParam path to file to parse
126      */
127             public static String toTikaXhtmlString(final String filePathParam)
128                     throws IOException, SAXException, TikaException
129                 {
130                     SourceCodeParser parser = new SourceCodeParser();
131                     ContentHandler handler = new ToXMLContentHandler();
132                     Metadata metadata = new Metadata();
133                     File file = new File(filePathParam);
134                     try (InputStream stream
135                             = ContentHandlerExample.class
136                               .getResourceAsStream(filePathParam)) {
137                         parser.parse(stream, handler, metadata);
138                         return handler.toString();
139                     } catch (Exception e) {
140                         System.out.println("Caught exception.");
141                         System.out.println(e.toString());
142                         e.printStackTrace();
143                         throw e;
144                     }
145                     
146                 }    

I also tried avoiding the Tika 'ContentHandlerExample' class using direct call with InputStreamReader, to the same result:

public static String toTikaXhtmlString(final String filePathParam)
        throws IOException, SAXException, TikaException
    {
        SourceCodeParser parser = new SourceCodeParser();
        ContentHandler handler = new ToXMLContentHandler();
        Metadata metadata = new Metadata();
        File file = new File(filePathParam);


        try (InputStream stream = new FileInputStream(file)) {
            parser.parse(stream, handler, metadata);
            return handler.toString();
        } catch (Exception e) {
            throw new RuntimeException(e.getMessage());
        }
    }

JUNIT Test

108         @Test
109         public void parseFile() {
110                 String fileName, verifyInput, resultContent;
111 
112                 //arrange
113                 fileName = "/Users/johnmeyer/Projects/code-proc/FileParseTest-run.txt";
114 
115                 String fileContent = "/** Test */ public MyTestClass {"
116                                    + "public static void main(String[] args) {"
117                                    + "System.out.println(\"This is a test.\"); }";
118 
119 
120                 LocalFile.putText(fileName, fileContent);
121 
122                 verifyInput = LocalFile.getContent(fileName);
123 
124                 assertEquals(fileContent, verifyInput);
125                 //act (and clean up)
126 
127                 try {
128 
129                     resultContent = LocalFile.toTikaXhtmlString(fileName);
130                 } catch (Exception e) {
131                     throw new RuntimeException(e.getMessage());
132                 }
133 
134                 LocalFile.delete(fileName);
135 
136                 //assert
137                 assertEquals(fileContent, resultContent);
138         }

Stack Trace

[INFO] Running us.johnmeyer.test.tools.FileParseTest Caught exception. java.lang.NullPointerException java.lang.NullPointerException at org.apache.commons.io.input.ProxyInputStream.markSupported(ProxyInputStream.java:181) at org.apache.tika.detect.AutoDetectReader.getBuffered(AutoDetectReader.java:137) at org.apache.tika.detect.AutoDetectReader.(AutoDetectReader.java:114) at org.apache.tika.parser.code.SourceCodeParser.parse(SourceCodeParser.java:93) at org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53) at us.johnmeyer.utilities.LocalFile.toTikaXhtmlString(LocalFile.java:137) at us.johnmeyer.test.tools.FileParseTest.parseFile(FileParseTest.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:373) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:334) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:119) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:407)

Tika Source Code

17 package org.apache.tika.io;
 18 
 19 import java.io.FilterInputStream;
 20 import java.io.IOException;
 21 import java.io.InputStream;
 22 
 23 /**
 24  * A Proxy stream which acts as expected, that is it passes the method
 25  * calls on to the proxied stream and doesn't change which methods are
 26  * being called.
 27  * <p>
 28  * It is an alternative base class to FilterInputStream
 29  * to increase reusability, because FilterInputStream changes the
 30  * methods being called, such as read(byte[]) to read(byte[], int, int).
 31  * <p>
 32  * See the protected methods for ways in which a subclass can easily decorate
 33  * a stream with custom pre-, post- or error processing functionality.
 34  *
 35  * @author Stephen Colebourne
 36  * @version $Id$
 37  */
 38 public abstract class ProxyInputStream extends FilterInputStream {
 40     /**
 41      * Constructs a new ProxyInputStream.
 42      *
 43      * @param proxy  the InputStream to delegate to
 44      */
 45     public ProxyInputStream(InputStream proxy) {
 46         super(proxy);
 47         // the proxy is stored in a protected superclass variable named 'in'
 48     }

...

    174     /**
    175      * Invokes the delegate's <code>markSupported()</code> method.
    176      * @return true if mark is supported, otherwise false
    177      */
    178     @Override
    179     public boolean markSupported() {
    180         return in.markSupported();
    181     }
John
  • 741
  • 9
  • 18
  • Oliver, please reopen this as it is a question about how to use an Apache Library, not a NullPointerException within code that I myself have been working on. – John Dec 17 '17 at 18:20
  • Added this code to check inputs to Tika `parse()` method: `if (file == null || handler == null || metadata == null) throw new RuntimeException("Null passed as input.");` No nulls are being passed into it, yet it still yields NPE. – John Dec 17 '17 at 20:08
  • Try using a `TikaInputStream` instead? eg `TikaInputStream.get(new File(fileName))` ? – Gagravarr Dec 18 '17 at 12:29
  • @Gagravarr sounds good, I will try that. – John Dec 18 '17 at 19:27

0 Answers0