4

It looks to be a confusion for users like me as what are the main differences between azure blob storage and azure data lake storage, and in what user case azure blob storage fits better than azure data lake storage, and vice versa?

Thank you.

1 Answers1

0

Blob storage is an object storage with flat structure. An object bundles a file along with a name/identifier and some metadata. There is no concept of folders or hierarchy in blob storage. Although the use of slash (/) in file name gives the illusion of hierarchy when viewing the blob storage containers using azure portal or storage explorer. This use of slash or file name prefix can be thought of as virtual folders in blob storage.

(Ignore ADLS gen1 which is deprecated)

Azure data lake storage Gen2 (ADLS) on the other hand is a hierarchical storage. It has the concept of folders. Files are stored in folders just like local file system on your workstation. Apart from this it also has Linux like ACL's on files and folders. ADLS is azure's HDFS offering.

Now the real benefit of ADLS is that it's very efficient to move files, rename files, move folders, rename folders, etc. ADLS's efficient directory manipulation is beneficial for analytics workloads like databricks/spark which best operates on file systems.

Databricks can also work with blob storage, but these operations would not be performant and will involve lot of unnecessary data copying. For example:

  • Moving/renaming a blob is combination of copy and delete.
  • Renaming a folder on ADLS is quite simple whereas the same on blob storage would involve copying all the blobs with new names and then deleting the old blobs.

Use ADLS with tools like spark, databricks, etc. and blob storage for everything else. Also note that ADLS costs 3x more and might be missing some of the features like blob versioning, point in time recovery etc..

ns15
  • 5,604
  • 47
  • 51
  • When you configure a Blob storage you can activate a "hierarchical storage" option which allows you to use folders as well. I would say that the biggest difference are their data models (more on Cosmos DB) and querying capabilities (only in Cosmos DB). – Echo9k Jun 29 '23 at 12:21