Extracting tar.gz on databricks

Question

I'm trying to extract a tar.gz of around 5 gig (contents around 35) on our databricks environment. I have tried to extract it with :

tar = tarfile.open(blob_storage_location', 'r:gz')
tar.extractall()
tar.close()

Also copied it to our databricks environment and tried it then.

Also tried:

%sh
tar xvzf $(find /dbfs/tmp/ -name '*.tar.gz' -print ) -C /dbfs/tmp/

And:

shutil.unpack_archive(path, path, gz)

They all start and then keep hanging. Only when I use our biggest default cluster it works but I feel that it should work a smaller cluster as well (since it works on my laptop).

Difference clusters:

cluster 1
- Worker Type:
  - 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
  - Min Workers 2 Max Workers 8
cluster 2
- Worker Type:
  - 28.0 GB Memory, 4 Cores, 1 DBU Standard_DS3_v2
  - Workers 8

Any advice to get it working on the smaller one would be greatly appreciated.

Edit: I found this question back again and found the answer. You can create a custom cluster for this with just a single node. Then it will work fine.

From [this answer](https://stackoverflow.com/a/61749827/2126910) it seems like Databricks can't handle extracting tar files. — philMarius, Feb 18 '21 at 15:07

score 1 · Answer 1 · answered Jun 10 '21 at 06:36

when you use %sh or any of the Python libraries, it doesn't matter how many workers do you have - the work is done only on the driver node. I suspect that the problem is that you have many files and unpacking data to DBFS could be a bottleneck.

I would recommend to try to unpack data to the local disk first, and then move unpacked files to DBFS.

tar xvzf /dbfs/..../file.tar.gz -C /tmp/unpacked

and then move:

dbutils.fs.mv("file:/tmp/unpacked", "dbfs:/tmp/", True)

Extracting tar.gz on databricks

1 Answers1