1

I created a custom ParquetOutputFormat (class in org.apache.parquet.hadoop) to overwrite the getRecordWriter method. Inside the getRecordWriter method it access CodecFactory which is causing an IllegalAccessError. To attempt to fix the issue I tried creating my own class loader, but this did not help. I followed this blog post http://techblog.applift.com/upgrading-spark#advanced-case-parquet-writer

Before I created the custom class loader I was using the CustomParquetOutputFormat as following:

override def createOutputFormat: OutputFormat[Void, InternalRow] with Ext = new CustomParquetOutputFormat[InternalRow]() with Ext {
 ...
}

The issue happens when CustomParquetOutputFormat tries to access CodecFactory on line 274 when getRecordWriter is called:

  CodecFactory codecFactory = new CodecFactory(conf);

(This is line 274 of ParquetOutputFormat which CustomParquetOutputFormat access)

CodecFactory is package-private.

Custom Class Loader:

class CustomClassLoader(urls: Array[URL], parent: ClassLoader, whiteList: List[String])
  extends ChildFirstURLClassLoader(urls, parent) {
  override def  loadClass(name: String) = {
    if (whiteList.exists(name.startsWith)) {
      super.loadClass(name)
    } else {
      parent.loadClass(name)
    }
  }
}

Usage:

val sc: SparkContext = SparkContext.getOrCreate()
val cl: CustomClassLoader = new CustomClassLoader(sc.jars.map(new URL(_)).toArray,
  Thread.currentThread.getContextClassLoader, List(
    "org.apache.parquet.hadoop.CustomParquetOutputFormat",
    "org.apache.parquet.hadoop.CodecFactory",
    "org.apache.parquet.hadoop.ParquetFileWriter",
    "org.apache.parquet.hadoop.ParquetRecordWriter",
    "org.apache.parquet.hadoop.InternalParquetRecordWriter",
    "org.apache.parquet.hadoop.ColumnChunkPageWriteStore",
    "org.apache.parquet.hadoop.MemoryManager"
  ))


cl.loadClass("org.apache.parquet.hadoop.CustomParquetOutputFormat")
  .getConstructor(classOf[String], classOf[TaskAttemptContext])
  .newInstance(fullPathWithoutExt, taskAttemptContext)
  .asInstanceOf[OutputFormat[Void, InternalRow] with ProvidesExtension]

Error:

java.lang.IllegalAccessError: tried to access class org.apache.parquet.hadoop.CodecFactory from class org.apache.parquet.hadoop.customParquetOutputFormat
        at org.apache.parquet.hadoop.CustomParquetOutputFormat.getRecordWriter(CustomParquetOutputFormat.scala:40)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
        at org.apache.spark.custom.hadoop.HadoopWriter.<init>(HadoopWriter.scala:35)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetWriter.<init>(ParquetWriter.scala:16)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetWriterFactory.createWriter(ParquetWriterFactory.scala:71)
        at com.abden.custom.index.IndexBuilder$$anonfun$4.apply(IndexBuilder.scala:55)
        at com.abden.custom.index.IndexBuilder$$anonfun$4.apply(IndexBuilder.scala:54)
        at scala.collection.immutable.Stream.map(Stream.scala:418)
        at com.abden.custom.index.IndexBuilder.generateTiles(IndexBuilder.scala:54)
        at com.abden.custom.index.IndexBuilder.generateLayer(IndexBuilder.scala:155)
        at com.abden.custom.index.IndexBuilder.appendLayer(IndexBuilder.scala:184)
        at com.abden.custom.index.IndexBuilder$$anonfun$appendLayers$1$$anonfun$apply$1.apply(IndexBuilder.scala:213)
        at com.abden.custom.index.IndexBuilder$$anonfun$appendLayers$1$$anonfun$apply$1.apply(IndexBuilder.scala:210)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at com.abden.custom.util.SplittingByKeyIterator.foreach(SplittingByKeyIterator.scala:3)
        at com.abden.custom.index.IndexBuilder$$anonfun$appendLayers$1.apply(IndexBuilder.scala:210)
        at com.abden.custom.index.IndexBuilder$$anonfun$appendLayers$1.apply(IndexBuilder.scala:209)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

The error happens at this line in getRecordWriter:

val codecFactory = new CodecFactory(conf)

CodecFactory has no modifier so it is restricted to its packages. Even with the dynamic class loader to load all the classes from the same class loader I still get the IllegalAccessError

abden003
  • 1,325
  • 7
  • 24
  • 48
  • 2
    It’s strange that the error message shows `customParquetOutputFormat` (lower case c) whereas everything else refers to `CustomParquetOutputFormat` (upper case C). Besides that, you should be aware that `super.loadClass(name)` will also check the parent loader first and only try to resolve the class locally, if the parent didn’t find it. Well, and classes loaded by different class loaders are always considered to be in different (runtime) packages, regardless of their name. – Holger Dec 07 '16 at 10:03
  • Sorry, fixed the error message. I changed the name of the classes for this question and accidently used lower case – abden003 Dec 07 '16 at 18:16
  • Hello, can you share your code before the custom class loader to understand the issue you had before? Because implementing your own classloader seems to be overkill here ... – loicmathieu Dec 13 '16 at 13:54
  • @loicmathieu I added some context of how I was calling it before – abden003 Dec 13 '16 at 18:16
  • could you share the output of mvn clean ; mvn dependency:tree -U > output.txt – λ Allquantor λ Dec 17 '16 at 16:14

1 Answers1

1

So what you try to do is breaking the way Java works! You want to access a class that is package private outside of it's package by implementing your own classloader that allow to break the protection rules of the JVM (so you want to break the Java Language Specification!).

My answer is simple : DON'T DO THIS!

If it's package private, you cannot access it. Period!

I think the best is to think in term of what functionality you need and to implement it with the current API without trying to force your way in. So instead of asking how to do some technical hack, the best is to explain what you want to do (why you want to implement your own getRecordWriter method.

I already give an answer in this SOW question about how to read/write parquet file in plain java : Write Parquet format to HDFS using Java API with out using Avro and MR

Regards,

Loïc

Community
  • 1
  • 1
loicmathieu
  • 5,181
  • 26
  • 31