I'm trying to extract a tar.gz of around 5 gig (contents around 35) on our databricks environment. I have tried to extract it with :
tar = tarfile.open(blob_storage_location', 'r:gz')
tar.extractall()
tar.close()
Also copied it to our databricks environment and tried it then.
Also tried:
%sh
tar xvzf $(find /dbfs/tmp/ -name '*.tar.gz' -print ) -C /dbfs/tmp/
And:
shutil.unpack_archive(path, path, gz)
They all start and then keep hanging. Only when I use our biggest default cluster it works but I feel that it should work a smaller cluster as well (since it works on my laptop).
Difference clusters:
- cluster 1
- Worker Type:
- 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
- Min Workers 2 Max Workers 8
- Worker Type:
- cluster 2
- Worker Type:
- 28.0 GB Memory, 4 Cores, 1 DBU Standard_DS3_v2
- Workers 8
- Worker Type:
Any advice to get it working on the smaller one would be greatly appreciated.
Edit: I found this question back again and found the answer. You can create a custom cluster for this with just a single node. Then it will work fine.