2

I am currently new to .NET for Spark and need to append a C# list to a delta table. I assume I first need to create a Spark DataFrame to do this. In the sample code how would I go about appending "names" to the dataframe "df"?

It seems now this has been deprecated (https://github.com/Microsoft/Mobius) that using RDD's is not available with the new version (https://github.com/dotnet/spark)

using System.Collections.Generic;
using Microsoft.Spark.Sql;

namespace HelloSpark
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();
            var df = spark.Read().Json("people.json");
            df.Show();

            var names = new List<string> { "john", "20" };

        }
    }
}

The example file people.json looks like the following:

{"name":"Michael"}
{"name":"Andy", "age":"30"}
{"name":"Justin", "age":"19"}
AeroX
  • 3,387
  • 2
  • 25
  • 39
ow123
  • 21
  • 1
  • 2

2 Answers2

1

You can now create a dataframe in .NET for Apache Spark (you couldn't when this question was written).

To do it you pass in an array of GenericRow's which take an array of objects for each column. You also need to define the schema:


using System;
using System.Collections.Generic;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;

namespace CreateDataFrame
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();
            
            var df = spark.Read().Json("people.json");
            df.Show();

            var names = new List<string> { "john", "20" };

            var newNamesDataFrame = spark.CreateDataFrame(
                new List<GenericRow>{new GenericRow(names.ToArray())},
                    new StructType(
                    new List<StructField>()
                    {
                        new StructField("name", new StringType()),
                        new StructField("age", new StringType())
                    }));
            
            newNamesDataFrame.Union(df).Show();
        }
    }
}

Now you have the data frame you can write it using DataFrameWriter.Write.Format("delta").Save("/Path/ToFile")

Dharman
  • 30,962
  • 25
  • 85
  • 135
Ed Elliott
  • 6,666
  • 17
  • 32
0

You need to create another Dataframe using the list and union it with the original dataframe. Once done you can write it external storage. You can look for corresponding C# apis based on the Psuedo code below

 var names = new List<string> { "john", "20" };
 // Create a Dataframe using this list
 // In scala you can do spark.createDataFrame using the list.
 var newdf = spark.createDataFrame(names,yourschemaclass)
 // union it with original df
 var joineddf = df.union(newdf)
 // write to external storage if you want
 joineddf.write()
Amit
  • 1,111
  • 1
  • 8
  • 14
  • There does not seem to be an equivalent API for createDataFrame from what I can see in the new version of Apache Spark for .NET – ow123 Aug 08 '19 at 11:11
  • Oh okay. Could you please share the link to documentation. I tried searching but ended up on github. – Amit Aug 08 '19 at 13:19
  • https://learn.microsoft.com/en-us/dotnet/spark/resources/ - This seems to be the only documentation I can find – ow123 Aug 08 '19 at 15:22
  • and this - https://learn.microsoft.com/en-us/dotnet/api/?view=spark-dotnet – ow123 Aug 08 '19 at 15:31
  • Thanks, I do not see similar APIs as they have in Scala. One option is to put the data that you have in a JSON or csv file and create Dataframe out of that. Post that union would remain the same. – Amit Aug 08 '19 at 20:54
  • Please note that one of the recent releases of .NET for Apache Spark supports the CreateDataFrame() function now. – Michael Rys Aug 06 '20 at 01:15