I'm processing a JSON file to generate two JSON files using Spark (version 1.6.1). The size of input file is about 30~40G (100M records). For the generated files, the bigger one is about 10G ~ 15G (30M records), the smaller one is about 500M ~ 750M (1.5M records). both result files are facing the below problems:
I invoked the "sort" method for the dataframe, after that performed "repartition" to merge the results into a single file. Then I checked the generated file, found in an interval the records are ordered, but the whole file is not ordered globally. e.g. the key (constructed from 3 columns) of the last record (line no 1.9M) in the file is "(ou7QDj48c, 014, 075)", but the key of a middle record in the file (line no 375K) is "(pzwzh5vm8, 003, 023)"
pzwzh5vm8 003 023
...
ou7QDj48c 014 075
When I tested code locally using a relatively small input source (input file 400K lines), such case doesn't happen at all.
My concrete code is shown below:
big_json = big_json.sort($"col1", $"col2", $"col3", $"col4")
big_json.repartition(1).write.mode("overwrite").json("filepath")
Could anyone give an advice? Thank you.
(I've also noticed that this thread discussed a similar problem, but there is not a good solution till now. If this phenomenon is really resulted from repartition operation, could anyone help me to effectively transform dataframe to a single JSON file without transform it into RDD, while keep the sorted order? Thanks)
Solution:
Really appreciate for the help from @manos @eliasah and @pkrishna. I had thought about using coalesce after read your comments but after having investigated its performance I gave up the idea.
The final solution is: sort the dataframe and write into JSON, without any repartition or coalesce. After the whole work is done, call the HDFS command below
hdfs dfs -getmerge /hdfs/file/path/part* ./local.json
This command is far better than my imagine. It neither takes too much time nor too much space, and gives me a good single file. I just used head
and tail
on the huge result file and it seems totally ordered.