I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Has anyone done this successfully without breaking the bank?