Apache Spark to_json options parameter

Question

I either don't know what I'm looking for or the documentation is lacking. The latter seems to be the case, given this:

http://spark.apache.org/docs/2.2.2/api/java/org/apache/spark/sql/functions.html#to_json-org.apache.spark.sql.Column-java.util.Map-

"options - options to control how the struct column is converted into a json string. accepts the same options and the json data source."

Great! So, what are my options?

I'm doing something like this:

Dataset<Row> formattedReader = reader
    .withColumn("id", lit(id))
    .withColumn("timestamp", lit(timestamp))
    .withColumn("data", to_json(struct("record_count")));

...and I get this result:

{
  "id": "ABC123",
  "timestamp": "2018-11-16 20:40:26.108",
  "data": "{\"record_count\": 989}"
}

I'd like this (remove back-slashes and quotes from "data"):

{
  "id": "ABC123",
  "timestamp": "2018-11-16 20:40:26.108",
  "data": {"record_count": 989}
}

Is this one of the options by chance? Is there a better guide out there for Spark? The most frustrating part about Spark hasn't been getting it to do what I want, it's been a lack of good information on what it can do.

You should [parse](https://stackoverflow.com/q/34069282/10465355) JSON string first, only after that, apply `to_json`. — 10465355, Nov 20 '18 at 00:38

score 3 · Accepted Answer · answered Nov 20 '18 at 06:04

3

You are json encoding twice for the record_count field. Remove to_json. struct alone should be sufficient.

As in change your code to something like this.

Dataset<Row> formattedReader = reader
    .withColumn("id", lit(id))
    .withColumn("timestamp", lit(timestamp))
    .withColumn("data", struct("record_count"));

answered Nov 20 '18 at 06:04

Biswanath

9,075
12
44
58

Ha! I swear I had tried that and gotten an exception. It works. Thanks! – Tsar Bomba Nov 20 '18 at 15:43
Welcome ! It happens to best. – Biswanath Nov 20 '18 at 15:59

Apache Spark to_json options parameter

1 Answers1