I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned.
I would like to coalesce the resulting DataFrame and order the rows by a column. I tried
DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")
I also tried
DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")
the output file is ordered in chunks (i.e. the partitions are ordered, but the data frame is not ordered as a whole). For example, instead of
1, value
2, value
4, value
4, value
5, value
5, value
...
I get
2, value
4, value
5, value
-----------> partition boundary
1, value
4, value
5, value
- What is the correct way to get an absolute ordering of my query result?
- Why isn't the data frame being coalesced into a single partition?