6

As per title, I would like to request a calculation to a Spark cluster (local/HDInsight in Azure) and get the results back from a C# application.

I acknowledged the existence of Livy which I understand is a REST API application sitting on top of Spark to query it, and I have not found a standard C# API package. Is this the right tool for the job? Is it just missing a well known C# API?

The Spark cluster needs to access Azure Cosmos DB, therefore I need to be able to submit a job including the connector jar library (or its path on the cluster driver) in order for Spark to read data from Cosmos.

AeroX
  • 3,387
  • 2
  • 25
  • 39
Stefano d'Antonio
  • 5,874
  • 3
  • 32
  • 45

4 Answers4

4

As a .NET Spark connector to query data did not seem to exist I wrote one

https://github.com/UnoSD/SparkSharp

It is just a quick implementation, but it does have also a way of querying Cosmos DB using Spark SQL

It's just a C# client for Livy but it should be more than enough.

using (var client = new HdInsightClient("clusterName", "admin", "password"))
using (var session = await client.CreateSessionAsync(config))
{
    var sum = await session.ExecuteStatementAsync<int>("val res = 1 + 1\nprintln(res)");

    const string sql = "SELECT id, SUM(json.total) AS total FROM cosmos GROUP BY id";

    var cosmos = await session.ExecuteCosmosDbSparkSqlQueryAsync<IEnumerable<Result>>
    (
        "cosmosName",
        "cosmosKey",
        "cosmosDatabase",
        "cosmosCollection",
        "cosmosPreferredRegions",
        sql
    );
}
Stefano d'Antonio
  • 5,874
  • 3
  • 32
  • 45
2

If your just looking for a way to query your spark cluster using SparkSql then this is a way to do it from C#:

https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql/blob/master/Program.cs

The console app requires an ODBC driver installed. You can find that here:

https://www.microsoft.com/en-us/download/details.aspx?id=49883

Also the console app has a bug: add this line to the code after the part where the connection string is generated. Immediately after this line:

connectionString = GetDefaultConnectionString();

Add this line

connectionString = connectionString + "DSN=Sample Microsoft Spark DSN";

If you change the name of the DSN when you install the spark ODBC Driver you will need to change the name in the above line then.

Since you need to access data from Cosmos DB, you could open a Jupyter Notebook on your cluster and ingest data into spark (create a permanent table of your data there) and then use this console app/your c# app to query that data.

If you have a spark job written in scala/python and need to submit it from a C# app then I guess LIVY is the best way to go. I am unsure if Mobius supports that.

stt_code
  • 23
  • 1
  • 7
  • The query will constantly change, so I don't think I can set it up in Jupyter and then query it with the ODBC driver, what you are suggesting (a permanent table) is more static, am I right? I presume it will also be held in memory and will need to be recreated on restart/data changes (?) if so, it won't be suitable. – Stefano d'Antonio Jul 01 '17 at 08:08
  • @stfano your table can always be appended with new data. if your table is cached you can just refresh cache. Creating first table: parquet_reader.write.saveAsTable("ptable") Appending to the table: new_parquet.write.saveAsTable("ptable", mode='append') Also your query can keep changing. You will need to modify the git hub code a little but you can have it taking dynamic queries. – stt_code Jul 06 '17 at 16:28
0

Microsoft just released a dataframe based .NET support for Apache Spark via the .NET Foundation OSS. See http://dot.net/spark and http://github.com/dotnet/spark for more details. It is now available in HDInsight per default if you select the correct HDP/Spark version (currently 3.6 and 2.3, soon others as well).

Michael Rys
  • 6,684
  • 15
  • 23
-1

UPDATE:

Long ago I said a clear no to this question. However times has changed and Microsoft made an effort. Pleas check out https://dotnet.microsoft.com/apps/data/spark

https://github.com/dotnet/spark

    // Create a Spark session
    var spark = SparkSession
    .Builder()
    .AppName("word_count_sample")
    .GetOrCreate();

Writing spark applications in C# now is that easy!

OUTDATED:

No, C# is not the tool you should choose if you would like to work with Spark! However if you really want to do the job with it try as mentioned above Mobius https://github.com/Microsoft/Mobius

Spark has 4 main languages and API-s for them: Scala, Java, Python, R. If you are looking for a language in production I would not suggest the R API. The Other 3 work well.

For Cosmo DB connection I would suggest: https://github.com/Azure/azure-cosmosdb-spark

András Nagy
  • 311
  • 2
  • 11