1

I am using the following components - Hadoop 3.1.4 , Hive 3.1.3 and Tez 0.9.2 And there is an ORC table from which I am trying to extract count of the rows in the table. select count(*) from ORC_TABLE and this throws the below set of exceptions

Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1670915386694_0182_1_00, diagnostics=[Vertex vertex_1670915386694_0182_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: jio_ar_consumer_events initializer failed, vertex=vertex_1670915386694_0182_1_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:519)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:765)
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)

Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1790)
    ... 17 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
    at org.apache.hadoop.hive.ql.io.AcidUtils.lambda$getAcidState$0(AcidUtils.java:1117)
    at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
    at java.util.TimSort.sort(TimSort.java:220)
    at java.util.Arrays.sort(Arrays.java:1512)
    at java.util.ArrayList.sort(ArrayList.java:1464)
    at java.util.Collections.sort(Collections.java:177)
    at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1115)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:1207)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.access$1500(OrcInputFormat.java:1142)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1179)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1176)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1176)
    at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1142)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more


There is another article where the same problem has been described ORC Split Generation issue with Hive Table but there isnt any solution as such yet. I also tried running CONCATENATE function on top of ORC Table but that didn't help either.

What works though is, if I run select * from ORC_TABLE with or without LIMIT, it seems to extract the records. I reckon issue must only be with aggregate functions or may be I don't get the issue yet.

I am also using Spark 3.3.1 and I can extract the same count through Spark Context Spark Sql utility and able to fetch the rows as well. No issues with Spark in that front.

Adding on to it, When I change the execution engine to MR, then this works. Fails only when I run this on Tez Engine.

Any leads to resolve this issue is much appreciated.

Afroz Baig
  • 36
  • 5
  • `What works though is, if I run select * from ORC_TABLE with or without LIMIT, it seems to extract the records.` I think what's happening here is that this issue only triggers when Hive/Tez tries to run a job. Simple queries like `SELECT *` don't require a job, Hive/Tez can just read rows from the file. So there is no issue. When you do an aggregate function or filter (`WHERE SomeValue LIKE '%a%'`) then a job must be run on the cluster, which triggers the issue. – Patrick Tucci Dec 25 '22 at 12:02
  • 1
    I asked the question you linked to. I never received any help and could not resolve the issue. I moved over to Spark. The support for data warehousing/ETL workloads is worse, but it was much easier to set up, and unlike Hive/Tez, it actually works. I hope you can find a solution. – Patrick Tucci Dec 25 '22 at 12:03
  • 1
    Yes. As of now, we are resorting to MR execution to extract the count and other agg functions. Will update here If I get to some solution. – Afroz Baig Dec 25 '22 at 12:52
  • This looks pretty much like it has issues in reading through Hadoop library jar files And does it mean that It is unable to do so? Is there anything that is conflicting in Tez own library files and hadoop lib files? like is there a conflict jar file which does not have this method and that this takes precendence? These are different angles that I am looking at. – Afroz Baig Dec 25 '22 at 13:16
  • 1
    @AfrozBaig your analysis is right. Below error means there are multiple jars / classes having different version of org.apache.hadoop.fs.FileStatus.compareTo java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I You should try adding verbose option in working vs non working command and comapre what jar is loading org.apache.hadoop.fs.FileStatus.compareTo – Raid Dec 25 '22 at 16:37
  • Ref : https://stackoverflow.com/questions/10230279/java-verbose-class-loading – Raid Dec 25 '22 at 16:39
  • Thanks @Raid Though I could not enable verbose as I am not a Java freak or geek, I just drilled down looking for dependencies. public int compareTo(Object o) { FileStatus other = (FileStatus) o; return compareTo(other); } https://github.com/apache/hadoop/commit/f4e42a728b7db69c4fa1c3f7d2e42eea110107b7 These are the lines of code that fixes this issue and this is available from Hadoop 2.8.2 Tez 0.9.2 was coming by default with Hadoop 2.7.2 dependencies that was missing this. Updated the solution below in Answer section – Afroz Baig Dec 26 '22 at 12:31

1 Answers1

1

The issue was resolved by the below steps based my previous analysis:

This class org.apache.hadoop.fs.FileStatus comes as a part of hadoop common jar file.

We were using Hadoop 3.1.4 & Tez 0.9.2

Tez 0.9.2 contains a tez.tar.gz that needs to be placed onto HDFS location. This tez.tar.gz contained hadoop-common-2.7.2.jar (This does not have the method compareTo that is thrown as an exception as shown in the error )

Solution :

We extracted the tez.tar.gz and replaced all hadoop 2.7.2 related jars with hadoop 3.1.4 jars. Do this if you dont want to reconfigure again with new tez version. Otherwise you could follow solution 2 as mentioned.

Recreated the tar and placed it across all dependent locations including HDFS as well. For us it was in /user/tez/share/tez.tar.gz location. It changes accordingly.

This error disappeared after I followed the steps and now I am able to do count of records on any table.

Solution 2 : Other solution that you could easily do is, use 0.10.x Tez version that contains libraries for hadoop 3.x version. Rather than 0.9.2 Tez version which is compatible with hadoop 2.7.x version.

Afroz Baig
  • 36
  • 5