1

I either don't know what I'm looking for or the documentation is lacking. The latter seems to be the case, given this:

http://spark.apache.org/docs/2.2.2/api/java/org/apache/spark/sql/functions.html#to_json-org.apache.spark.sql.Column-java.util.Map-

"options - options to control how the struct column is converted into a json string. accepts the same options and the json data source."

Great! So, what are my options?

I'm doing something like this:

Dataset<Row> formattedReader = reader
    .withColumn("id", lit(id))
    .withColumn("timestamp", lit(timestamp))
    .withColumn("data", to_json(struct("record_count")));

...and I get this result:

{
  "id": "ABC123",
  "timestamp": "2018-11-16 20:40:26.108",
  "data": "{\"record_count\": 989}"
}

I'd like this (remove back-slashes and quotes from "data"):

{
  "id": "ABC123",
  "timestamp": "2018-11-16 20:40:26.108",
  "data": {"record_count": 989}
}

Is this one of the options by chance? Is there a better guide out there for Spark? The most frustrating part about Spark hasn't been getting it to do what I want, it's been a lack of good information on what it can do.

Tsar Bomba
  • 1,047
  • 6
  • 29
  • 52
  • You should [parse](https://stackoverflow.com/q/34069282/10465355) JSON string first, only after that, apply `to_json`. – 10465355 Nov 20 '18 at 00:38

1 Answers1

3

You are json encoding twice for the record_count field. Remove to_json. struct alone should be sufficient.

As in change your code to something like this.

Dataset<Row> formattedReader = reader
    .withColumn("id", lit(id))
    .withColumn("timestamp", lit(timestamp))
    .withColumn("data", struct("record_count"));
Biswanath
  • 9,075
  • 12
  • 44
  • 58