1

I have legacy data stored as CSV in an Azure DataLake Gen2 storage account. I'm able to connect to this and interrogate it using DataBricks. I have a requirement to remove certain records once their retention period expires, or if a GDPR "right to be forgotten" needs applying to the data.

Using Delta I can load a CSV into a Delta table and use SQL to locate and delete the required rows, but what is the best way to save these changes? Ideally back to the original file, so that the data is removed from the original. I've used the LOCATION option when creating the Delta table to persist the generated Parquet format files to the DataLake but it would be nice to keep it in the original CSV format.

Any advice appreciated.

3 Answers3

0

I'd be careful here. Right to be forgotten means you need to delete the data. Delta doesn't actually delete it from the original file (initially at least) - this will only happen once the data is vacuumed.

The safest way to delete data is to read all the data into a dataframe, filter off the records you do not want and then write it back using overwrite. This will ensure the data is remove and the same structure is re-written.

simon_dmorias
  • 2,343
  • 3
  • 19
  • 33
  • Thanks. I’m interested in you mentioning that delta doesn’t delete it from the original file until you vacuum. I’ve tried vacuum with retain 0, but the original source CSV file remains in its original state. By original do you mean the file created when the delta table was made and not the source CSV? This bit has me confused as to how it works. I’ll have a play with dataframe.write again later - when I tried before I ended up with a folder full of files including a randomly named CV. – ChrisCornwall May 25 '19 at 06:32
  • You can only delete from a delta file - not a csv. To remove an entry from a csv file you must rewrite it. – simon_dmorias May 25 '19 at 20:41
  • I've found a solution to creating a single CSV file with a non-GUID filename here: https://stackoverflow.com/a/50722167/985106, just changed it to point to the right place and use ADLS storage (abfss URI rather than s3a). I understand using coalesce will force all of the data onto one node but given the size of the files this should be ok. Will mark your answer as correct as you pointed me in the right direction - thanks! – ChrisCornwall May 27 '19 at 08:58
0

Convert Parquet to CSV in ADF

The versioned parquet files created in the ADLS Gen2 location can be converted to CSV using the Copy Data task in an Azure Data Factory pipeline.

So, you could read the CSV data into a Delta table(with location pointing to a Data lake folder), perform the required changes using SQL and then convert the parquet files to CSV format using ADF.

I have tried this and it works. The only hurdle might be detecting the column headers while reading the CSV file to Delta. You could read it to a dataframe and create a Delta table from it.

SA2010
  • 183
  • 4
  • 12
0

If you are running the delete operations periodically then it is costly to save file in csv, As every time you are reading the file and transforming the dataframe to Delta and then query on it and finally after filtering the records you are again saving it to csv and deleting the Delta table.

So my suggestion here would be, transform the csv to Delta once, perform delete periodically and generate csv only when it's needed.

The advantage here is - Delta internally stores data in parquet format which stores data in binary format and allow better compression and encoding/decoding of data.