Note to community: Please do not close this as duplicate because the particular issue I am researching has manifested as a null pointer exception. As you can see from the stack trace, the NPE is buried 4 layers deep in the Tika library. That means of all the great advice that was given in the existing StackExchange post on NPE, none of the Tika developers saw fit to apply that advice (checking for null pointers) in four modules. Rather than learn Tika and retrofit their code with a patch to do that work, it think it would be more efficient to ask if anyone had achieved the common use case of using the SourcCodeParser.
I am looking for help with a published example for the Tika library here. I did not author the example code. I have seen many similar questions relating to the Tika library, which has 20 contributors and thousands of lines of code. Please do not close this question as I believe this can be quickly easily answered by anyone who used this Parser before. I have already read the post on NullPointerException, and am following this advice from that question:
I still can't find the problem
If you tried to debug the problem and still don't have a solution, you can post a question for more help, but make sure to include what you've tried so far. At a minimum, include the stacktrace in the question, and mark the important line numbers in the code.
As I spent much time authoring this post, retrieving and including relevant stack trace and source code, I would really appreciate it if you would allow this to spend a little bit of time in an unclosed state so that someone who is familiar with Tika might take a look at what appears to be fairly common issue. As you would know as a Java expert, many null pointer exception issues can be non-trivial, particularly when working with a large unfamiliar framework. I really appreciate your help.
I wrote a simple program to test the Tika SourceCodeParser
, by substituting it for the AutoDetectParser
in the XHTML parsing example from the Tika Examples page. When executing the parse command on line 137, there is a NullPointerException. It appears that there may be a delegate missing from the in
on line 180 of the Parser code.
The AutoDetectParser
works but does not identify the source code as java.
When I use the Tika desktop app, it works fine and recognizes the code as Java.
How do I initialize the SourceCodeParser
to avoid the NullPointerException
when operating it?
Example using Tika "Example" Package LocalFile.toTikaXhtmlString()
123 /** Parses as Tika using source code parser.
124 *
125 * @param filePathParam path to file to parse
126 */
127 public static String toTikaXhtmlString(final String filePathParam)
128 throws IOException, SAXException, TikaException
129 {
130 SourceCodeParser parser = new SourceCodeParser();
131 ContentHandler handler = new ToXMLContentHandler();
132 Metadata metadata = new Metadata();
133 File file = new File(filePathParam);
134 try (InputStream stream
135 = ContentHandlerExample.class
136 .getResourceAsStream(filePathParam)) {
137 parser.parse(stream, handler, metadata);
138 return handler.toString();
139 } catch (Exception e) {
140 System.out.println("Caught exception.");
141 System.out.println(e.toString());
142 e.printStackTrace();
143 throw e;
144 }
145
146 }
I also tried avoiding the Tika 'ContentHandlerExample' class using direct call with InputStreamReader, to the same result:
public static String toTikaXhtmlString(final String filePathParam)
throws IOException, SAXException, TikaException
{
SourceCodeParser parser = new SourceCodeParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
File file = new File(filePathParam);
try (InputStream stream = new FileInputStream(file)) {
parser.parse(stream, handler, metadata);
return handler.toString();
} catch (Exception e) {
throw new RuntimeException(e.getMessage());
}
}
JUNIT Test
108 @Test
109 public void parseFile() {
110 String fileName, verifyInput, resultContent;
111
112 //arrange
113 fileName = "/Users/johnmeyer/Projects/code-proc/FileParseTest-run.txt";
114
115 String fileContent = "/** Test */ public MyTestClass {"
116 + "public static void main(String[] args) {"
117 + "System.out.println(\"This is a test.\"); }";
118
119
120 LocalFile.putText(fileName, fileContent);
121
122 verifyInput = LocalFile.getContent(fileName);
123
124 assertEquals(fileContent, verifyInput);
125 //act (and clean up)
126
127 try {
128
129 resultContent = LocalFile.toTikaXhtmlString(fileName);
130 } catch (Exception e) {
131 throw new RuntimeException(e.getMessage());
132 }
133
134 LocalFile.delete(fileName);
135
136 //assert
137 assertEquals(fileContent, resultContent);
138 }
Stack Trace
[INFO] Running us.johnmeyer.test.tools.FileParseTest Caught exception. java.lang.NullPointerException java.lang.NullPointerException at org.apache.commons.io.input.ProxyInputStream.markSupported(ProxyInputStream.java:181) at org.apache.tika.detect.AutoDetectReader.getBuffered(AutoDetectReader.java:137) at org.apache.tika.detect.AutoDetectReader.(AutoDetectReader.java:114) at org.apache.tika.parser.code.SourceCodeParser.parse(SourceCodeParser.java:93) at org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53) at us.johnmeyer.utilities.LocalFile.toTikaXhtmlString(LocalFile.java:137) at us.johnmeyer.test.tools.FileParseTest.parseFile(FileParseTest.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:373) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:334) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:119) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:407)
Tika Source Code
17 package org.apache.tika.io;
18
19 import java.io.FilterInputStream;
20 import java.io.IOException;
21 import java.io.InputStream;
22
23 /**
24 * A Proxy stream which acts as expected, that is it passes the method
25 * calls on to the proxied stream and doesn't change which methods are
26 * being called.
27 * <p>
28 * It is an alternative base class to FilterInputStream
29 * to increase reusability, because FilterInputStream changes the
30 * methods being called, such as read(byte[]) to read(byte[], int, int).
31 * <p>
32 * See the protected methods for ways in which a subclass can easily decorate
33 * a stream with custom pre-, post- or error processing functionality.
34 *
35 * @author Stephen Colebourne
36 * @version $Id$
37 */
38 public abstract class ProxyInputStream extends FilterInputStream {
40 /**
41 * Constructs a new ProxyInputStream.
42 *
43 * @param proxy the InputStream to delegate to
44 */
45 public ProxyInputStream(InputStream proxy) {
46 super(proxy);
47 // the proxy is stored in a protected superclass variable named 'in'
48 }
...
174 /**
175 * Invokes the delegate's <code>markSupported()</code> method.
176 * @return true if mark is supported, otherwise false
177 */
178 @Override
179 public boolean markSupported() {
180 return in.markSupported();
181 }