11

I have a lot of data files that will be eventually be pushed and stored on the Azure Storage/Data Lake at a regular interval of time. I want to provide an ability to do Analytic on this data but then I see that on Azure there are two approach:

  1. U-SQL / Azure Data Lake query (Visualization ???)
  2. Spark SQL using Spark on Azure and Zeppelin

can some one suggest me when to use which of this approach? it looks to me that both can do the similar job.

Dan Ciborowski - MSFT
  • 6,807
  • 10
  • 53
  • 88
Kiran
  • 2,997
  • 6
  • 31
  • 62

1 Answers1

17

You can think of U-SQL as Microsoft's version of Spark SQL, where you can write SQL Server styled SQL and extend with User-Defined Functions in C#. While with Spark you write in a Semi MySQL styled SQL and extend it with either Scala or Python.

If you are familiar with Scala or Python then choosing HDInsight might be the best choice. Spark comes with GraphX and MLLib which at the moment have no analogues in Data Lake Analytics. Also if you need something that works outside of Azure then SparkSQL is your only option.

Another important dimension to think about is the pricing. Data Lake Analytics only costs money while your query is executing, but HDInsight costs money for as long as the cluster is running. Depending on the size of the data and the complexity of your queries Data Lake Analytics can be cheaper because you aren't charged while it's provisioning.

wm_eddie
  • 3,938
  • 22
  • 22
  • 3
    Another aspect to consider besides @wm_eddie's is that today, U-SQL is available for batch workloads only, while SparkSQL has an interactive experience through notebooks. One caveat at the point of me writing this comment is, that Spark in HDInsight is not yet working with ADLS (see http://stackoverflow.com/a/35569240/1318169). – Michael Rys Feb 24 '16 at 10:33
  • 3
    Spark/PySpark are now supported on HDInsight. After several months (~6) working with ADLA and a couple months with HDInsight, it really comes down to :: Skillset of platform users and platform support; Need for a Persistent vs On-demand cluster; and the type/size of data you need to process. I consistently find that Analysts ramp-up faster on U-SQL, since they already know ANSI SQL, but Data Engineers tend to gravitate to Spark. Also, U-SQL expects clean/structured data. HDInsight has better PowerBI integration too. ps. I would use Jupyter notebooks "but" Yarn configuration is critical – jatal Feb 12 '18 at 19:59