0

Apache spark newbie here. I am querying a very large data set from ADLS using Apache Spark for .net. After querying my data, i want to convert the dataframe to a csv file and send it to an api that consumes the csv file. I have the following

queryResult_df
  .Coalesce(1)
  .Write()
  .Format("csv")
  .Option("header", "true")
  .Csv(<local_output_location>)

This takes hours to complete and is not optimal for what i'm trying to do. Is there a way to generate the csv in a more optimal way and send it off to the consuming api rather than output it locally?

Bonaii
  • 75
  • 6
  • I feel that output isn't the slow part... it might be that your query is slow. Note that the query isn't actually evaluated until you call `.Write()` - that's Spark's lazy evaluation. – mck Dec 18 '20 at 19:03
  • double on that - writing happens after query is done, plus you're doing the `.coalesce(1)` – Alex Ott Dec 18 '20 at 19:07
  • @Alex correct me if im wrong but it is my understanding that `.coalesce(1)` will attempt to consolidate all partition if there not many of them else it will maintain the same amount of partitions. @mck so i tried `.show()` on my queryResult dataframe, it excuted fairly fast. is the show method handled differently? – Bonaii Dec 18 '20 at 20:32
  • Show will take only small piece of results. If you on Spark 3, try to write to noop format (better) or do count on dataframe – Alex Ott Dec 18 '20 at 20:34
  • Coalesce will still consolidate everything to one executor – Alex Ott Dec 18 '20 at 20:34

0 Answers0