7

Spark SQL has support for automatically inferring the schema from a JSON input source (each row is a standalone JSON file) - it does so by scanning the entire data set to create the schema but it's still useful. (I'm talking about 1.2.1, not the new 1.3, so there might be some changes)

I've seen some conflicting posts about it being supported / not supported, but I think it was recently added (in 1.2)

My question is - what is the right way to format a Date/Datetime/Timestamp in JSON for Spark SQL to identify it as such in it's auto schema inference mechanism?

Eran Medan
  • 44,555
  • 61
  • 184
  • 276
  • 4
    From my experiments and from reading the implementation of `org.apache.spark.sql.json.JsonRDD` it looks to me like it won't automatically infer these types. When you read a JSON file or create JSON from an `RDD[String]` you don't _have_ to have it infer the schema -- you can provide one. I don't know if that's acceptable to you. I can give you the code but I hesitate to make it an "Answer" because you specifically asked about "auto schema inference" and there's nothing automatic about my solution other than the parsing. – Spiro Michaylov Mar 28 '15 at 19:40
  • 1
    @SpiroMichaylov: I've turned your comment into an answer and fleshed it out a bit. I hope you don't mind. I think it would be useful if you added your code example for creating the JSON with a custom schema to either my answer or to an answer of your own. – Daniel Darabos Jul 20 '15 at 13:12

4 Answers4

8

It is possible to infer dates using a format of your choosing (I used the Date.toJSON format) with a little modification and also have reasonable performance.

Get the latest maintenance branch:

git clone https://github.com/apache/spark.git
cd spark
git checkout branch-1.4

Replace the following block in InferSchema:

  case VALUE_STRING if parser.getTextLength < 1 =>
    // Zero length strings and nulls have special handling to deal
    // with JSON generators that do not distinguish between the two.
    // To accurately infer types for empty strings that are really
    // meant to represent nulls we assume that the two are isomorphic
    // but will defer treating null fields as strings until all the
    // record fields' types have been combined.
    NullType

  case VALUE_STRING => StringType

with the following code:

  case VALUE_STRING =>
    val len = parser.getTextLength
    if (len < 1) {
      NullType
    } else if (len == 24) {
      // try to match dates of the form "1968-01-01T12:34:56.789Z"
      // for performance, only try parsing if text is 24 chars long and ends with a Z
      val chars = parser.getTextCharacters
      val offset = parser.getTextOffset
      if (chars(offset + len - 1) == 'Z') {
        try {
          org.apache.spark.sql.catalyst.util.
            DateUtils.stringToTime(new String(chars, offset, len))
          TimestampType
        } catch {
          case e: Exception => StringType
        }
      } else {
        StringType
      }
    } else {
      StringType
    }

Build Spark according to your setup. I used:

mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests=true clean install

To test, create a file named datedPeople.json at the top level which contains the following data:

{"name":"Andy", "birthdate": "2012-04-23T18:25:43.511Z"}
{"name":"Bob"}
{"name":"This has 24 characters!!", "birthdate": "1988-11-24T11:21:13.121Z"}
{"name":"Dolla Dolla BillZZZZZZZZ", "birthdate": "1968-01-01T12:34:56.789Z"}

Read in the file. Make sure that you set the conf option before using sqlContext at all, or it won't work. Dates!

.\bin\spark-shell.cmd
scala> sqlContext.setConf("spark.sql.json.useJacksonStreamingAPI", "true")
scala> val datedPeople = sqlContext.read.json("datedPeople.json")

datedPeople: org.apache.spark.sql.DataFrame = [birthdate: timestamp, name: string]

scala> datedPeople.foreach(println)

[2012-04-23 13:25:43.511,Andy]
[1968-01-01 06:34:56.789,Dolla Dolla BillZZZZZZZZ]
[null,Bob]
[1988-11-24 05:21:13.121,This has 24 characters!!]
Community
  • 1
  • 1
heenenee
  • 19,914
  • 1
  • 60
  • 86
  • This is the closest that it gets to what I wanted, I like the way you lower the amount of exceptions. do you mind if I take this approach and put it as an addon format (instead of modifying the base json reader) – Eran Medan Jul 21 '15 at 23:09
6

The JSON type inference will never infer date types. Non-zero-length strings are always inferred to be strings. Source code:

private[sql] object InferSchema {
  // ...
  private def inferField(parser: JsonParser): DataType = {
    import com.fasterxml.jackson.core.JsonToken._
    parser.getCurrentToken match {
      // ...
      case VALUE_STRING => StringType
      // ...
    }
  }
  // ...
}

For automatic detection this would have to be changed to look at the actual string (parser.getValueAsString) and based on the format return DateType when appropriate.

It's probably simpler to just take the normal auto-generated schema and convert the date types as a second step.

Another option would be to read a small sample of the data (without using Spark) and infer the schema yourself. Then use your schema to create the DataFrame. This avoids some computation as well.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • Yes, I saw that after I posted the bounty, I thought adding a new rule there, e.g. Date(...) but I will need a function of "replace token" or "skip chars" but the text is already tokenized. it means I will need to extend not just spark but also jackson JSON parser. Apparently JSON is just not fit for this job if you need to treat dates as primitive data types... the bounty is yours (unless, however unlikely, another answer will be more fit) thank you! – Eran Medan Jul 20 '15 at 17:51
  • How are your dates stored? I assumed each date was stored as a string. If it is, the string will be parsed as a single token. So you could still go token by token. I mean, that's how I imagine it is — I haven't tried modifying the schema inference code personally :). You could also consider filing a feature request in the [Spark JIRA](https://issues.apache.org/jira/browse/SPARK), perhaps something good comes of it down the line. – Daniel Darabos Jul 20 '15 at 20:53
  • Yep, as string, but I think it will be costly to try to parse each and everyone as date and determine if it's a date or string based on that... :) – Eran Medan Jul 21 '15 at 02:08
1

Since Spark 3.0, JSON datasource infers TimestampType from string values if they match to the pattern defined by the JSON option timestampFormat.

Option inferTimestamp can be set to false to disable such type inferring.

Pandawan
  • 2,027
  • 1
  • 18
  • 24
1

To provide an update for 2021 here, at this time strings can be infered as TimestampType, but not DateType. See: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#json-scala.collection.Seq-

You can set an option in spark.read.format("json") or in from_json() to handle the format that gets parsed as a timestamp. That option is timestampFormat and is described here: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#json-scala.collection.Seq-

Zafar
  • 1,897
  • 15
  • 33