3

To my understanding, the data-lake solution is used for storing everything from raw-data in the original format to processed data. I have not able to understand the concept of metadata-management in the (Azure) data-lake though. What are best-practices for dealing with metadata in the data-lake?

Are there any mechanisms to read metadata automatically (e.g. from header files) and if yes, are there any ways to view and edit this metadata (maybe an API to do it programatically)? I am worried that without proper management, the 'lake' will just turn to a "data-grave". One solution may be to create an own database where I myself store the metadata for each file. Are there any other more state-of-the-art approaches?

halfer
  • 19,824
  • 17
  • 99
  • 186
AlexGuevara
  • 932
  • 11
  • 28

2 Answers2

1

This is a pretty broad question which I will try best to answer. In general, you try to organize data in the data lake store by logical areas and identifiable owning teams. Data can be cataloged in Azure Data Catalog for discovery and enrichment. At present we do not have any automatic abilities to publish data into the data catalog. The owners of the data have to manually publish it to ADC. If there are specific features in this area that are of interest, please submit and upvote them here: https://feedback.azure.com/forums/327234-data-lake

Amit Kulkarni
  • 910
  • 4
  • 11
1

Library classification should be considered as a best practice approach for ordering data in a data lake, because library classification systems order information / knowledge / data in disjunct categories.

Technically, you can use disjunct category information in (file-) names / (file-) paths or include it as header information or attributes in the files. In Azure, library classification approaches can additionally be applied when adding tags to data in the Azure Data Catalog.

Franziska W.
  • 346
  • 2
  • 4