17

I'm trying to upgrade from Spark 2.1 to 2.2. When I try to read or write a dataframe to a location (CSV or JSON) I am receiving this error:

Illegal pattern component: XXX
java.lang.IllegalArgumentException: Illegal pattern component: XXX
at org.apache.commons.lang3.time.FastDatePrinter.parsePattern(FastDatePrinter.java:282)
at org.apache.commons.lang3.time.FastDatePrinter.init(FastDatePrinter.java:149)
at org.apache.commons.lang3.time.FastDatePrinter.<init>(FastDatePrinter.java:142)
at org.apache.commons.lang3.time.FastDateFormat.<init>(FastDateFormat.java:384)
at org.apache.commons.lang3.time.FastDateFormat.<init>(FastDateFormat.java:369)
at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:91)
at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:88)
at org.apache.commons.lang3.time.FormatCache.getInstance(FormatCache.java:82)
at org.apache.commons.lang3.time.FastDateFormat.getInstance(FastDateFormat.java:165)
at org.apache.spark.sql.catalyst.json.JSONOptions.<init>(JSONOptions.scala:81)
at org.apache.spark.sql.catalyst.json.JSONOptions.<init>(JSONOptions.scala:43)
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:333)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:279)

I am not setting a default value for dateFormat, so I'm not understanding where it is coming from.

spark.createDataFrame(objects.map((o) => MyObject(t.source, t.table, o.partition, o.offset, d)))
    .coalesce(1)
    .write
    .mode(SaveMode.Append)
    .partitionBy("source", "table")
    .json(path)

I still get the error with this:

import org.apache.spark.sql.{SaveMode, SparkSession}
val spark = SparkSession.builder.appName("Spark2.2Test").master("local").getOrCreate()
import spark.implicits._
val agesRows = List(Person("alice", 35), Person("bob", 10), Person("jill", 24))
val df = spark.createDataFrame(agesRows).toDF();

df.printSchema
df.show

df.write.mode(SaveMode.Overwrite).csv("my.csv")

Here is the schema:

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = false)
Geoff Langenderfer
  • 746
  • 1
  • 9
  • 21
Lee
  • 658
  • 1
  • 6
  • 13
  • 2
    I don't see anything wrong with your code. can you please share MyObject class definition? Try to convert object manually to json then try to save as string – Rahul Sharma Sep 26 '17 at 15:05
  • case class MyObject(source: String, table: String, partition: Int, offset: Long, updatedOn: String) – Lee Sep 26 '17 at 17:17
  • Read and write date fields as String. Operate on date field manually using SimpleDateFormat – Rahul Sharma Sep 26 '17 at 18:31

4 Answers4

36

I found the answer.

The default for the timestampFormat is yyyy-MM-dd'T'HH:mm:ss.SSSXXX which is an illegal argument. It needs to be set when you are writing the dataframe out.

The fix is to change that to ZZ which will include the timezone.

df.write
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.mode(SaveMode.Overwrite)
.csv("my.csv")
Shaido
  • 27,497
  • 23
  • 70
  • 73
Lee
  • 658
  • 1
  • 6
  • 13
  • 4
    Also if you're trying to read a file: `df = spark.read.option('timestampFormat', 'yyyy/MM/dd HH:mm:ss ZZ').json(PATH_TO_FILE)` – William Luxion Jan 22 '18 at 05:58
  • Correct, this only happens for CSV and JSON. – Lee Mar 14 '18 at 18:14
  • 1
    Oddly ... I have no timestamps in my output. There is a timestamp column in the prior-to-filtering stages, but still this option was required to avoid the stackdump. – codeaperature Oct 25 '18 at 20:47
26

Ensure you are using the correct version of commons-lang3

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-lang3</artifactId>
  <version>3.5</version>
</dependency>
Mauro Pirrone
  • 391
  • 3
  • 5
  • 2
    why commons-lang3 have something to do here ? – Haha TTpro Apr 18 '18 at 03:09
  • I'm also interested by an explanation – Romibuzi Sep 18 '18 at 12:24
  • 7
    In CDH, hive-exec-1.1.0-cdh5.15.1.jar also has the class "FastDateFormat" which is not supporting the default format "yyyy-MM-dd'T'HH:mm:ss.SSSXXX" of org.apache.spark.sql.catalyst.json.JSONOptions. So ensure commons-lang3.3.5 jar is in your classpath. In SBT add dependency with compile option. "org.apache.commons" % "commons-lang3" % "3.5" % "compile" – Nagaraj Vittal Apr 29 '19 at 09:18
4

Use commons-lang3-3.5.jar fixed the original error. I didn't check the source code to tell why but it is no surprising as the original exception happens at org.apache.commons.lang3.time.FastDatePrinter.parsePattern(FastDatePrinter.java:282). I also noticed the file /usr/lib/spark/jars/commons-lang3-3.5.jar (on an EMR cluster instance) which also suggest 3.5 is the consistent version to use.

danzhi
  • 51
  • 4
-2

I also met this problem, and my solution(reason) is: Because I put a wrong format json file to hdfs. After I put a correct text or json file, it can go correctly.

  • There were no timestamps in the file, just epochs, which are longs. Thanks for the comment. – Lee Oct 28 '19 at 14:15