I am currently setting up a data lake trying to follow the principles of Delta Lake (landing in bronze, cleaning and merging into silver, and then, if needed, presenting the final view in gold) and have a question about what should be stored in Silver.
For example, if the data in bronze comes in from a REST API and is stored in the JSON form it comes in in this format:
- id (Int)
- name (String)
- fields (Array of Strings)
An example looks like:
{
'id':12345,
'name':'Test',
'fields':['Hello','this','is','a','test']
}
In the end I want to present this as two tables. One would be the base table and look like:
TABLE 1
| id | name |
| -------- | -------------- |
| 12345 | Test |
And another would look like:
TABLE 2
| id | field_value |
| -------- | -------------- |
| 12345 | Hello |
| 12345 | this |
| 12345 | is |
| 12345 | a |
| 12345 | test |
My question is, should I pre-process the data in Spark and store the data in silver in separate folders like this:
-- root
---table 1
----file1.parquet
----etc.parquet
---table 2
----file1.parquet
----etc.parquet
Or store it all in silver under one folder and then derive those two tables using TSQL and functions like OPENJSON later?
Thank you for your help or insight!