10

Pretty self-explanatory question. When should I use Azure ML Notebooks VS Azure Databricks? I feel there’s a great overlap between the two products and one is definitely better marketed than the other..

I’m mainly looking for information concerning datasets sizes and typical workflow. Why should I use Databricks over AzureML if I don’t have a Spark oriented workflow ?

Thanks !

Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
dernat71
  • 365
  • 4
  • 16

1 Answers1

13

@Nethim, from my pov these are the main difference:

  1. Data Distribution:

    • Azure ML Notebooks are good when you are training with a limited data on single machine. Though Azure ML provides training clusters, the data distribution among the nodes is to be handled in the code.
    • Azure Databricks with its RDDs are designed to handle data distributed on multiple nodes.This is advantageous when your data size is huge.When your data size is small and can fit in a scaled up single machine/ you are using a pandas dataframe, then use of Azure databricks is a overkill
  2. Data Cleaning: Databricks can support a lot of file formats natively and querying and cleaning huge datasets are easy where as this has to be handled custom in AzureML notebooks. This can be done with a aml notebooks but cleaning and writing to stores has to be handled.

  3. Training Both has the capabilities if distributing the training, Databricks provides inbuilt ML algorithms that can act on chunk of data on that node and coordinate with other nodes. Though this can be done on both AzureMachineLearning and Databricks with tf,horovod etc.,

In general(just my opinion), if the dataset is small, aml notebooks is good.If the data size is huge, then Azure databricks is easy for datacleanup and format conversions.Then the training can happen on AML or databricks.Though databricks has a learning curve whereas Azure ML can be easy with the python and pandas.

Thanks.

SriramN
  • 432
  • 5
  • 19
  • 1
    Hey SriramN, thanks a lot for your answer :-) It really helps ! – dernat71 Apr 02 '20 at 10:59
  • 1
    This is a very informative and to-the-point answer! Thank you very much – D_S_toowhite Aug 12 '22 at 07:53
  • Hi, thank you for the answer! What is the data size which qualifies as a small or big dataset? – Péter Szilvási Mar 22 '23 at 10:01
  • @PéterSzilvási, from my experience, and I stand to be corrected, small data in terms of number of rows is that one that has a couple of million (2 to 5) rows. Even though, in a single computer, the memory is your limit. More than those, we can qualify the data as big as it would need more power for processing. Then Spark comes in Databricks with the parallel computes. – Julien Nyambal Aug 25 '23 at 08:50