0

I wrote a simple code to parse a large XML file ( extract lines, clean text, and remove any html tags from it) using Apache Spark.

I'm seeing a NullPointerException when calling .replaceAllIn on a string, which is non-null.

The funny thing is that I have no errors when I run the code locally, using input from disk, but I get a NullPointerException when I run the same code on AWS EMR, loading the input file from S3.

Here is the relevant code:

val HTML_TAGS_PATTERN = """<[^>]+>""".r

// other code here...

spark
.sparkContext
.textFile(pathToInputFile, numPartitions)
.filter { str => str.startsWith("  <row ") }
.toDS()
.map { str =>

  Locale.setDefault(new Locale("en", "US"))

  val parts = str.split(""""""")

  var title: String = ""
  var body: String = ""

  // some code ommitted here

  title = StringEscapeUtils.unescapeXml(title).toLowerCase.trim
  body = StringEscapeUtils.unescapeXml(body).toLowerCase // decode xml entities


  println("before replacing, body is: "+body)

  // NEXT LINE TRIGGERS NPE
  body = HTML_TAGS_PATTERN.replaceAllIn(body, " ") // take out htmltags

}

Things I've tried:

  • printing the string just before calling replaceAllIn to make sure it's not null.

  • making sure the Locale is not null

  • printing out the exception message, and stacktrace: it just tells me that that line is where the NullPointerException occurs. Nothing more

Things that are different between my local setup and AWS EMR:

  • in my local setup, I load the input file from disk, on EMR I load it from s3.

  • in my local setup, I run Spark in standalone mode, on EMR it's run in cluster mode.


Everything else is the same on my machine and on AWS EMR: Scala version, Spark version, Java version, Cluster configs...

I have been trying to figure this out for some hours and I can't think of anything else to try.

EDIT

I've moved the call to r() to within the map{} body, like this:

val HTML_TAGS_PATTERN = """<[^>]+>"""

// code ommited

.map{

   body = HTML_TAGS_PATTERN.r.replaceAllIn(body, " ")    

 }

This also produces a NPE, wit the following stracktrace:

java.lang.NullPointerException
    at java.util.regex.Pattern.<init>(Pattern.java:1350)
    at java.util.regex.Pattern.compile(Pattern.java:1028)
    at scala.util.matching.Regex.<init>(Regex.scala:191)
    at scala.collection.immutable.StringLike$class.r(StringLike.scala:255)
    at scala.collection.immutable.StringOps.r(StringOps.scala:29)
    at scala.collection.immutable.StringLike$class.r(StringLike.scala:244)
    at scala.collection.immutable.StringOps.r(StringOps.scala:29)
    at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:102)
    at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:72)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spar
Felipe
  • 11,557
  • 7
  • 56
  • 103
  • 1
    Have you tried putting the regex inline ? Is it the regex that is null not the string body. It might be that the regex Isnt getting distributed correctly to the workers. Can you include some of the stack trace. – Stephen Jun 04 '17 at 08:24
  • @Stephen I've tried just calling `.r()` inside the `map{}` and now this is what the stacktrace looks like: [gist link here](https://gist.github.com/queirozfcom/0aece2c4912017f0941a78a03de97fe9) (it does look like it's something to do with the regex) – Felipe Jun 04 '17 at 20:34
  • @Stephen I now put the entire String declaration into the worker code and it looks like it's a win....=).... Write it out as an answer, so I can tick you green. =) – Felipe Jun 04 '17 at 21:16
  • I think the solution is here is stop using regex for HTML / XML documents – OneCricketeer Jun 04 '17 at 23:38
  • @cricket_007 I did try using spark-xml to parse my file. But I kept running into OOM errors, even with 16GB of RAM given to Spark. Whereas reading the file as a text file and doing some minor parsing turned out to be very fast. I even asked a question on SO about that: https://stackoverflow.com/questions/43796443/out-of-memory-error-when-reading-large-file-in-spark-2-1-0 – Felipe Jun 05 '17 at 00:40

3 Answers3

2

I think you should try putting the regex inline like bellow.

This is a bit of a lame solution, you should be able to define a constant, maybe put it in a global object or something. Im not sure where you are defining it that would be a problem. But remember spark serialises the code and runs it on distributed workers, so something could be going wrong with that.

rdd.map { _ =>
   ...

   body = """<[^>]+>""".r.replaceAllIn(body, " ")    

 }

I get a very similar error when I run .r on a null String.

val x: String = null 
x.r 
java.lang.NullPointerException
  java.util.regex.Pattern.<init>(Pattern.java:1350)
  java.util.regex.Pattern.compile(Pattern.java:1028)
  scala.util.matching.Regex.<init>(Regex.scala:223)
  scala.collection.immutable.StringLike.r(StringLike.scala:281)
  scala.collection.immutable.StringLike.r$(StringLike.scala:281)
  scala.collection.immutable.StringOps.r(StringOps.scala:29)
  scala.collection.immutable.StringLike.r(StringLike.scala:270)
  scala.collection.immutable.StringLike.r$(StringLike.scala:270)
  scala.collection.immutable.StringOps.r(StringOps.scala:29)

That error has slightly different line numbers, I think because of the scala version. Im on 2.12.2.

Stephen
  • 4,228
  • 4
  • 29
  • 40
  • By the way, do you have any idea as to why a simple string can't be passed on to the map blocks? This seems pretty basic. – Felipe Jun 06 '17 at 04:34
  • 1
    Its not because its a String or a regex. Its something to do with where that string lives. Eg, In a object or a anonymous class nested in something. Im not sure, maybe look into how spark serializes code. – Stephen Jun 06 '17 at 06:44
0

Thanks to Stephen's answer I found why I was getting a NPE on my UDF... I went this way (finding a match in my case):

def findMatch(word: String): String => Boolean = { s =>
    Option(s) match {
      case Some(validText) => if (word.toLowerCase.r.findAllIn(validText.toLowerCase).nonEmpty) true else false
      case None            => false
    }
  }
Gaarv
  • 814
  • 8
  • 15
0

"<[^>]+>" was great, but I have one type of things in my HTML. it consists of a name of style and then parameters in between curly braces:

p { margin-top: 0px;margin-bottom: 0px;line-height: 1.15; }
body { font-family: 'Arial';font-style: Normal;font-weight: normal;font-size: 14.6666666666667px; }.Normal { telerik-style-type: paragraph;telerik-style-name: Normal;border-collapse: collapse; }.TableNormal { telerik-style-type: table;telerik-style-name: TableNormal;border-collapse: collapse; }.s_4C87DD5E { telerik-style-type: local;font-family: 'Arial';font-size: 14.6666666666667px;color: #000000; }.s_8D20FCAB { telerik-style-type: local;font-family: 'Arial';font-size: 14.6666666666667px;color: #000000;text-decoration: underline; }.p_53E06EE5 { telerik-style-type: local;margin-left: 0px; } 

I tried to extract them using the following, but it didn't work:

"\{[^\}]+\}"
DejanS
  • 96
  • 9