1

I'm using xsbt-proguard-plugin, which is an SBT plugin for working with Proguard.

I'm trying to come up with a Proguard configuration for a Hive Deserializer I've written, which has the following dependencies:

// project/Dependencies.scala
val hadoop      = "org.apache.hadoop"          %  "hadoop-core"          % V.hadoop
val hive        = "org.apache.hive"            %  "hive-common"          % V.hive
val serde       = "org.apache.hive"            %  "hive-serde"           % V.hive
val httpClient  = "org.apache.httpcomponents"  %  "httpclient"           % V.http 
val logging     = "commons-logging"            %  "commons-logging"      % V.logging
val specs2      = "org.specs2"                 %% "specs2"               % V.specs2      % "test"

Plus an unmanaged dependency:

// lib/UserAgentUtils-1.6.jar

Because most of these are either for local unit testing or are available within a Hadoop/Hive environment anyway, I want my minified jarfile to only include:

  • The Java classes SnowPlowEventDeserializer.class and SnowPlowEventStruct.class
  • org.apache.httpcomponents.httpclient
  • commons-logging
  • lib/UserAgentUtils-1.6.jar

But I'm really struggling to get the syntax right. Should I start from a whitelist of classes I want to keep, or explicitly filter out the Hadoop/Hive/Serde/Specs2 libraries? I'm aware of this SO question but it doesn't seem to apply here.

If I initially try the whitelist approach:

// Should be equivalent to sbt> package
import ProguardPlugin._
lazy val proguard = proguardSettings ++ Seq(
  proguardLibraryJars := Nil,
  proguardOptions := Seq(
    "-keepattributes *Annotation*,EnclosingMethod",
    "-dontskipnonpubliclibraryclassmembers",
    "-dontoptimize",
    "-dontshrink",
    "-keep class com.snowplowanalytics.snowplow.hadoop.hive.SnowPlowEventDeserializer",
    "-keep class com.snowplowanalytics.snowplow.hadoop.hive.SnowPlowEventStruct"
  )
)

Then I get a Hadoop processing error, so clearly Proguard is still trying to bundle Hadoop:

proguard: java.lang.IllegalArgumentException: Can't find common super class of [[Lorg/apache/hadoop/fs/FileStatus;] and [[Lorg/apache/hadoop/fs/s3/Block;]

Meanwhile if I try Proguard's filtering syntax to build up the blacklist of libraries I don't want to include:

import ProguardPlugin._
lazy val proguard = proguardSettings ++ Seq(
  proguardLibraryJars := Nil,
  proguardOptions := Seq(
    "-keepattributes *Annotation*,EnclosingMethod",
    "-dontskipnonpubliclibraryclassmembers",
    "-dontoptimize",
    "-dontshrink",
    "-injars  !*hadoop*.jar"
  )
)

Then this doesn't seem to work either:

proguard: java.io.IOException: Can't read [/home/dev/snowplow-log-deserializers/!*hadoop*.jar] (No such file or directory)

Any help greatly appreciated!

Community
  • 1
  • 1
Alex Dean
  • 15,575
  • 13
  • 63
  • 74

2 Answers2

1

The whitelist is the proper approach: ProGuard should get a complete context, so it can properly shake out classes, fields, and methods that are not needed.

The error "Can't find common super class" suggests that some library is still missing from the input. ProGuard probably warned about it, but the configuration appears to contain the option -ignorewarnings or -dontwarn (which should be avoided). You should add the library with -injars or -libraryjars.

If ProGuard then includes some classes that you weren't expecting in the output, you can get an explanation with "-whyareyoukeeping class somepackage.SomeUnexpectedClass".

Starting from a working configuration, you can still try to filter out classes or entire jars from the input. Filters are added to items in a class path though, not on their own, e.g. "-injars some.jar(!somepackage/**.class)" -- cfr. the manual. This can be useful if the input contains test classes that drag in other unwanted classes.

Eric Lafortune
  • 45,150
  • 8
  • 114
  • 106
  • Thanks Eric, I appreciate the help. In the end though, I couldn't get beyond duplicate class errors, let alone to excluding the specific jars. It also seemed a bit clumsy to exclude specific jars when I have a simple list of dependencies in SBT that I would rather work with/annotate for inclusion/exclusion. In the end I went with the sbt-assembly approach, see below. – Alex Dean May 31 '12 at 09:43
0

In the end, I couldn't get past duplicate class errors using Proguard, let alone how to figure out how to filter out the relevant jars, so finally switched to a much cleaner sbt-assembly approach:

-1. Added the sbt-assembly plugin to my project as per the README

-2. Updated the appropriate project dependencies with a "provided" flag to stop them being added into my fat jar:

val hadoop      = "org.apache.hadoop"          %  "hadoop-core"          % V.hadoop      % "provided"
val hive        = "org.apache.hive"            %  "hive-common"          % V.hive        % "provided"
val serde       = "org.apache.hive"            %  "hive-serde"           % V.hive        % "provided"
val httpClient  = "org.apache.httpcomponents"  %  "httpclient"           % V.http
val httpCore    = "org.apache.httpcomponents"  %  "httpcore"             % V.http  
val logging     = "commons-logging"            %  "commons-logging"      % V.logging     % "provided"
val specs2      = "org.specs2"                 %% "specs2"               % V.specs2      % "test"

-3. Added an sbt-assembly configuration like so:

import sbtassembly.Plugin._
import AssemblyKeys._
lazy val sbtAssemblySettings = assemblySettings ++ Seq(
  assembleArtifact in packageScala := false,
  jarName in assembly <<= (name, version) { (name, version) => name + "-" + version + ".jar" },
  mergeStrategy in assembly <<= (mergeStrategy in assembly) {
    (old) => {
      case "META-INF/NOTICE.txt" => MergeStrategy.discard
      case "META-INF/LICENSE.txt" => MergeStrategy.discard
      case x => old(x)
    }
  }
)

Then typing assembly produced a "fat jar" with just the packages I needed in it, including the unmanaged dependency and excluding Hadoop/Hive etc.

Alex Dean
  • 15,575
  • 13
  • 63
  • 74