0

I am new to apache spark. I am trying to use Microsoft apache nuget library to read data from ADLS. I cant seem to figure out how i can authenticate using spark. There seems to be no documentation around this at all. Is this even possible? I am writing a .Net framework console app.

Any help/pointers would be greatly appreciated!

Bonaii
  • 75
  • 6
  • Are you running it on databricks or locally on your desktop? If local is it windows or mac/linux? – Ed Elliott Nov 18 '20 at 07:38
  • @EdElliott Its a C# console application that will be deployed to an IIS server. – Bonaii Nov 18 '20 at 18:03
  • there is a great answer below for you - I have to question if you are using the right architecture, spark can run on a single local instance (or IIS server) but it isn't the typical use case. – Ed Elliott Nov 19 '20 at 16:04
  • @EdElliott I am trying to process few months worth of data stored on adls as a parquet file using c# .net framework console app that gets ran every few hours. The best way i have found to be able to do this is by the use of .net for apache spark. Do you know of a better way to do this from a console app? – Bonaii Nov 19 '20 at 19:55

1 Answers1

2

If you want to Azure data lake store in Spark, please refer to the following steps. Please note that I use the spark 3.0.1 with Hadoop 3.2 for test

  1. Create a Service Principal
az login
az ad sp create-for-rbac --name "myApp" --role contributor --scopes /subscriptions/<subscription-id>/resourceGroups/<group-name> --sdk-auth
  1. Grant Service Principal access to Data Lake
Connect-AzAccount
# get sp object id with sp's client id
$sp=Get-AzADServicePrincipal -ApplicationId  42e0d080-b1f3-40cf-8db6-c4c522d988c4

$fullAcl="user:$($sp.Id):rwx,default:user:$($sp.Id):rwx"
$newFullAcl = $fullAcl.Split("{,}")
Set-AdlStoreItemAclEntry -Account <> -Path / -Acl $newFullAcl -Recurse -Debug
  1. Code
string filePath =
                $"adl://{<account name>}.azuredatalakestore.net/parquet/people.parquet";

            // Create SparkSession
            SparkSession spark = SparkSession
                .Builder()
                .AppName("Azure Data Lake Storage example using .NET for Apache Spark")
                .Config("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
                .Config("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
                .Config("fs.adl.oauth2.client.id", "<sp appid>")
                .Config("fs.adl.oauth2.credential", "<sp password>")
                .Config("fs.adl.oauth2.refresh.url", $"https://login.microsoftonline.com/<tenant>/oauth2/token")
                .GetOrCreate();

            // Create sample data
            var data = new List<GenericRow>
            {
                new GenericRow(new object[] { 1, "John Doe"}),
                new GenericRow(new object[] { 2, "Jane Doe"}),
                new GenericRow(new object[] { 3, "Foo Bar"})
            };

            // Create schema for sample data
            var schema = new StructType(new List<StructField>()
            {
                new StructField("Id", new IntegerType()),
                new StructField("Name", new StringType()),
            });

            // Create DataFrame using data and schema
            DataFrame df = spark.CreateDataFrame(data, schema);

            // Print DataFrame
            df.Show();

            // Write DataFrame to Azure Data Lake Gen1
            df.Write().Mode(SaveMode.Overwrite).Parquet(filePath);

            // Read saved DataFrame from Azure Data Lake Gen1
            DataFrame readDf = spark.Read().Parquet(filePath);

            // Print DataFrame
            readDf.Show();

            // Stop Spark session
            spark.Stop();
  1. Run
spark-submit ^
--packages org.apache.hadoop:hadoop-azure-datalake:3.2.0 ^
--class org.apache.spark.deploy.dotnet.DotnetRunner ^
--master local ^
microsoft-spark-3-0_2.12-<version>.jar ^
dotnet <application name>.dll

enter image description here

For more details, please refer to

https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory

https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html

https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-access-control

Jim Xu
  • 21,610
  • 2
  • 19
  • 39
  • Where do i download the microsoft-spark jar file? I can't seem to locate that part @jim – Bonaii Nov 19 '20 at 23:59
  • @Bonsa After you run the command `dotnet build`, you will find the jar file in your build output directory(such as E:\test\mySparkApp\bin\Debug\netcoreapp3.1). For more details, please refer to https://learn.microsoft.com/en-us/dotnet/spark/tutorials/get-started?tabs=windows#run-your-net-for-apache-spark-app and https://dotnet.microsoft.com/learn/data/spark-tutorial/run – Jim Xu Nov 20 '20 at 00:58
  • 1
    @Bonsa It is same. But you need to download another Microsoft.Spark.Worker : https://github.com/dotnet/spark/releases – Jim Xu Nov 21 '20 at 11:30