Add a new line in front of each line before writing to JSON format using Spark in Scala

Question

I'd like to add one new line in front of each of my json document before Spark writes it into my s3 bucket:

df.createOrReplaceTempView("ParquetTable")
val parkSQL = spark.sql("select LAST_MODIFIED_BY, LAST_MODIFIED_DATE, NVL(CLASS_NAME, className) as CLASS_NAME, DECISION, TASK_TYPE_ID from ParquetTable")
parkSQL.show(false)
parkSQL.count()

parkSQL.write.json("s3://test-bucket/json-output-7/")

with only this command, it'll produce files with contents below:

{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}

but, what I'd like to achieve is something like below:

{"index":{}}
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}

Any insight on how to achieve this result would be greatly appreciated!

so that I could call `bulk` load API against Elasticsearch, e.g https://stackoverflow.com/questions/45601344/elasticsearch-bulk-json-data — Fisher Coder, May 22 '21 at 17:59

Srinivas · Accepted Answer · 2021-05-22T18:27:29.397

3

Below code will concat {"index":{}} with existing row data in DataFrame & It will convert data into json then save json data using text format.

df
.select(
    lit("""{"index":{}}""").as("index"),
    to_json(struct($"*")).as("json_data")
)
.select(
    concat_ws(
        "\n", // This will split index column & other column data into two lines.
        $"index",
        $"json_data"
    ).as("data")
)
.write
.format("text") // This is required.
.save("s3://test-bucket/json-output-7/")

Final Output

cat part-00000-24619b28-6501-4763-b3de-1a2f72a5a4ec-c000.txt

{"index":{}}
{"CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}

edited May 22 '21 at 18:27

answered May 22 '21 at 18:20

Srinivas

8,957
2
12
26

Thanks, I need them to be in `.json` format though. – Fisher Coder May 22 '21 at 18:43
oh.. I think it's not possible to use `json` format because if you use `json` format final output data will be `stringified`. – Srinivas May 22 '21 at 18:54
1

nvm, you are right, and using your command, I was able to bulk load this `.txt` file into my Elasticsearch, thanks a ton again! – Fisher Coder May 22 '21 at 18:58

Add a new line in front of each line before writing to JSON format using Spark in Scala

1 Answers1