Converting JSON .gz files into Delta Tables

Question

I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?

From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.

Has anyone done this successfully without breaking the bank?

score 1 · Answer 1 · answered Oct 29 '22 at 19:59

From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.

Yes, that's the big downside of gzip format - it is not splitable and therefore cannot be distributed across all your workers and cores - the Driver has to load a file in its entirety and decompress it in a single batch. Topic related to this question.

The only sensible workaround I've used myself is to make Driver have only few cores but as powerful ones as possible - I assume, since you are using Azure Blob, then you are using Databricks on Azure as well and here you can find all Azure VM types - just have to pick the one with fastest cores.

Not a driver, but one of the workers... – Alex Ott Nov 11 '22 at 09:13 — Alex Ott, Nov 11 '22 at 09:13

Converting JSON .gz files into Delta Tables

1 Answers1