0

I'm trying to export a Dataframe to a CSV file using .NET SPARK, but my export file has the default name 'part-00000-{GUID}', what i wanted was to manipulate the file's name according to my business rules, ex:'ABC_20200504.csv'.

This is my code:

string pathSource = Path.Combine(path, folderName);

exportDataFrame
                .Coalesce(1)
                .Write()
                .Option("header", "false")
                .Mode(SaveMode.Append)
                .Csv(pathSource);

I tried to manipulate the pathSource, forcing to export into a 'test.csv', but using this approach, I always get a directory with that name and the file will be inside the folder 'test.csv'.

I really need some solution for this, if someone could help, i would be very thankfull.

Michael Rys
  • 6,684
  • 15
  • 23
  • 1
    Put into your question your code as a text formatted as code (use a special button for that in the editor). Don't use images with source code. Check the preview of your question before posting. Here your image even not shown within your question. – V. S. May 04 '20 at 15:38
  • @VadimS. I've just edited, thanks for the comment. Can you a have a look please? – João Sousa May 05 '20 at 08:25
  • The text format is not the problem, i want to export dataframe to a csv (it's alredy doing that), my problem is to edit the filename, because spark always create the names by itself. – João Sousa May 05 '20 at 09:35

1 Answers1

1

Try this code:

exportDataFrame
    .Repartition(1)
    .Write()
    .Mode("overwrite")
    .Format("com.databricks.spark.csv")
    .Option("header", "true")
    .Save("ABC_20200504.csv");

It has to create a single file output as \ABC_20200504.csv\part-00000

Then you can rename the file part-0000 in the way like in this example:

System.IO.File.Move("D:\\part-00000.txt", "D:\\ABC_20200504.txt");  

The original solution was written in Scala, taken from the link below and edited for C#: https://www.dataneb.com/post/how-to-write-single-csv-file-using-spark The link describes 5 methods how to write to a single CSV-file.

V. S.
  • 1,086
  • 14
  • 14
  • Thanks for the comment, but that is for python right? For C# .net core framework, I don't have that import. Do you know any solution for .NET? Thanks. – João Sousa May 05 '20 at 13:33
  • Can you try the code below and let me know what result you have? The code in this comment is originally written in Scala and changed for .NET. I don't have right now an environment to check it for .NET so I just assume it may help ( the code is taken from here https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv ): exportDataFrame.Coalesce(1).Write().Format("com.databricks.spark.csv") .Option("header", "true") .Save("ABC_20200504.csv") – V. S. May 05 '20 at 14:17
  • Same result, it creates a folder "mydata.csv" which contains part-00000-{GUID}.csv files. I think the only solution will be to select all the files generated with that type of name and change their names according to my business rules. This is the approach i'm developing right now. – João Sousa May 05 '20 at 14:48
  • Try the solution from the updated answer I've just posted. It differs from the previous approach. It looks like it is all I can propose right now. – V. S. May 05 '20 at 15:55
  • 1
    Spark (and Hive) do not like to operate at the file level but prefer to operate at the folder level and use files as "scale-out extents". I ran Vadim's code in .NET for Spark on Azure Synapse and it worked. You would then need to rename the file. – Michael Rys May 21 '20 at 22:55