2

Does Spark v2.3.1 depends on local timezone when reading from JSON file?

My src/test/resources/data/tmp.json:

[
  {
    "timestamp": "1970-01-01 00:00:00.000"
  }
]

and Spark code:

SparkSession.builder()
    .appName("test")
    .master("local")
    .config("spark.sql.session.timeZone", "UTC")
    .getOrCreate()
    .read()
    .option("multiLine", true).option("mode", "PERMISSIVE")
    .schema(new StructType()
        .add(new StructField("timestamp", DataTypes.TimestampType, true, Metadata.empty())))
    .json("src/test/resources/data/tmp.json")
    .show();

Result:

+-------------------+
|          timestamp|
+-------------------+
|1969-12-31 22:00:00|
+-------------------+

How to make spark return 1970-01-01 00:00:00.000?

P.S. This question is not a duplicate of Spark Strutured Streaming automatically converts timestamp to local time, because provided there solution not work for me and is already included (see .config("spark.sql.session.timeZone", "UTC")) into my question.

Community
  • 1
  • 1
VB_
  • 45,112
  • 42
  • 145
  • 293
  • Could you confirm that `org.apache.spark.sql.RuntimeConfig` actually reflects the configuration? Does the other solution fix the problem? – zero323 Nov 11 '18 at 18:57
  • Actually nevermind. Setting `user.timezone` does solve the problem. As of this it qualifies for JIRA ticket, if there isn't one already. The behavior is not only confusing, but also inconsistent between different data sources. – zero323 Nov 11 '18 at 19:31
  • 1
    @user6910411 solved with `TimeZone.setDefault(TimeZone.getTimeZone("UTC"))` – VB_ Nov 11 '18 at 21:13
  • 1
    That however, wont' help you in a distributed mode. – zero323 Nov 12 '18 at 11:14
  • @user6910411 yeah, good point – VB_ Nov 12 '18 at 11:51
  • I think [this](https://stackoverflow.com/a/48767250/6910411) is still the best approach here. And `spark.sql.session.timeZone` affects only SQL code it might be a good idea to set it anyway, to ensure consistent results. – zero323 Nov 12 '18 at 11:53

0 Answers0