1

I am new to Apache Spark and am trying to write some rows into a Delta Table (locally currently, eventually into ADLSgen2) using the dotnet\spark package. I'm using the following approach, similar to this question, notably the .Format("delta") call:

using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;

public static async Task WriteToDelta() 
    {
    
        SparkSession spark = SparkSession.Builder().AppName("DeltaTableWrite").GetOrCreate();
        // Create a schema for the data
        StructType schema = new StructType(new[]
        {
        new StructField("id", new IntegerType()),
        new StructField("name", new StringType()),
        new StructField("age", new IntegerType())
        });

        // Create a DataFrame with sample data
        DataFrame df = spark.CreateDataFrame(new[]
        {
            new GenericRow(new object[]
            {1, "John Smith", 40}),
            new GenericRow(new object[]
            {2, "Jane Doe", 20}),
            new GenericRow(new object[]
            {3, "Bob Smith", 30})
        }, schema);

        // Write the DataFrame as a Delta table to blob storage
       df.Write()
        .Format("delta")
        .Option("mergeSchema", "true")
        .Mode(SaveMode.Overwrite)
        .Save(@"C:\source\path\to\table");
    }

However, when I run this I get the Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html error, which I understand to mean that I have to install the delta-sharing package. In their README though, there is no mention of support for C#/.NET, and I'm not sure how to install/add the package as part of the Apache Spark Connector. Is this something I'm installing for Java using Maven? Can someone highlight how to achieve this?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Norrec
  • 531
  • 4
  • 17

1 Answers1

1

You need to add the Delta Lake libraries as described in the documentation. You have two choices:

  • if you're using spark-submit to run the app, then you need to add following to command-line:
--packages io.delta:delta-core_2.12:2.3.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
  • Specify the same options but when setting the session:
SparkSession spark = SparkSession.Builder().AppName("DeltaTableWrite")
   .Config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
   .Config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
   .Config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0")
   .GetOrCreate();

P.S. Make sure that Delta Lake version matches the Spark version as per doc.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132