I am new to Apache Spark and am trying to write some rows into a Delta Table (locally currently, eventually into ADLSgen2) using the dotnet\spark package. I'm using the following approach, similar to this question, notably the .Format("delta")
call:
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
public static async Task WriteToDelta()
{
SparkSession spark = SparkSession.Builder().AppName("DeltaTableWrite").GetOrCreate();
// Create a schema for the data
StructType schema = new StructType(new[]
{
new StructField("id", new IntegerType()),
new StructField("name", new StringType()),
new StructField("age", new IntegerType())
});
// Create a DataFrame with sample data
DataFrame df = spark.CreateDataFrame(new[]
{
new GenericRow(new object[]
{1, "John Smith", 40}),
new GenericRow(new object[]
{2, "Jane Doe", 20}),
new GenericRow(new object[]
{3, "Bob Smith", 30})
}, schema);
// Write the DataFrame as a Delta table to blob storage
df.Write()
.Format("delta")
.Option("mergeSchema", "true")
.Mode(SaveMode.Overwrite)
.Save(@"C:\source\path\to\table");
}
However, when I run this I get the Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
error, which I understand to mean that I have to install the delta-sharing package. In their README though, there is no mention of support for C#/.NET, and I'm not sure how to install/add the package as part of the Apache Spark Connector. Is this something I'm installing for Java using Maven? Can someone highlight how to achieve this?