ORC Split Generation issue with Hive Table

Question

I'm using Hive version 3.1.3 on Hadoop 3.3.4 with Tez 0.9.2. When I create an ORC table that contains splits and try to query it, I get an ORC split generation failed exception. If I concatenate the table, this solves the issue in some cases. In others, however, the issue persists.

First I create the table like so, then try to query it:

CREATE TABLE ClaimsOrc STORED AS ORC
AS
SELECT *
FROM ClaimsImport;

SELECT COUNT(*) FROM ClaimsOrc WHERE ClaimID LIKE '%8%';

I then get the following exception:

Vertex failed, vertexName=Map 1, vertexId=vertex_1667735849290_0008_6_00, diagnostics=[Vertex vertex_1667735849290_0008_6_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: claimsorc initializer failed, vertex=vertex_1667735849290_0008_6_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:519)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:765)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1790)

However, if I concatenate the table first, which combines the output files into fewer smaller files, the table works fine:

ALTER TABLE ClaimsOrc CONCATENATE;
OK
Time taken: 11.673 seconds

SELECT COUNT(*) FROM ClaimsOrc WHERE ClaimID LIKE '%8%';
OK
1463419
Time taken: 7.446 seconds, Fetched: 1 row(s)

It appears something is going wrong with how the initial CTAS query calculates the splits, and that CONCATENATE fixes it in some cases. But in some cases, it doesn't, and there's no work around. How can I fix this?

A few other things worth noting:

Using DESCRIBE EXTENDED ClaimsOrc; shows that ClaimsOrc is an ORC table.
The source table ClaimsImport contains about 24 gzipped pipe delimited files.
Before the CONCATENATE, the ClaimsOrc table contains about 24 files
After the CONCATENATE, the ClaimsOrc table contains only 3 file splits
Before the CONCATENATE command, the ORC files appear to be valid. Using the orcfiledump command, I don't see any errors in the few I spot checked.

score 1 · Answer 1 · answered Dec 26 '22 at 14:51

Tez 0.9.2 contains a tez.tar.gz that needs to be placed onto HDFS location. This tez.tar.gz contained hadoop-common-2.7.2.jar by default(This does not have the method compareTo that is thrown as an exception as shown in the error )

Repackage this jar with latest Hadoop jars or copy from the version of yours (hadoop 3.3.4) and you may have to repackage with other jars like guava, Woodstox, stax2 api and many more. Put this repackaged tar gz of tez into all nodes and hdfs location.

This error should go away. You may end up with other errors like I said which you could solve with adding Additional Hadoop dependency jars.

Otherwise upgrade tez to 0.10.x version, validate its Hadoop version. Expecting it to be hadoop3.x This would straight away be the solution.

ORC Split Generation issue with Hive Table

1 Answers1

Linked