3

I'm trying to extract a tar.gz of around 5 gig (contents around 35) on our databricks environment. I have tried to extract it with :

tar = tarfile.open(blob_storage_location', 'r:gz')
tar.extractall()
tar.close()

Also copied it to our databricks environment and tried it then.

Also tried:

%sh
tar xvzf $(find /dbfs/tmp/ -name '*.tar.gz' -print ) -C /dbfs/tmp/

And:

shutil.unpack_archive(path, path, gz)

They all start and then keep hanging. Only when I use our biggest default cluster it works but I feel that it should work a smaller cluster as well (since it works on my laptop).

Difference clusters:

  • cluster 1
    • Worker Type:
      • 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
      • Min Workers 2 Max Workers 8
  • cluster 2
    • Worker Type:
      • 28.0 GB Memory, 4 Cores, 1 DBU Standard_DS3_v2
      • Workers 8

Any advice to get it working on the smaller one would be greatly appreciated.

Edit: I found this question back again and found the answer. You can create a custom cluster for this with just a single node. Then it will work fine.

Arjen p
  • 31
  • 1
  • 4
  • From [this answer](https://stackoverflow.com/a/61749827/2126910) it seems like Databricks can't handle extracting tar files. – philMarius Feb 18 '21 at 15:07

1 Answers1

1

when you use %sh or any of the Python libraries, it doesn't matter how many workers do you have - the work is done only on the driver node. I suspect that the problem is that you have many files and unpacking data to DBFS could be a bottleneck.

I would recommend to try to unpack data to the local disk first, and then move unpacked files to DBFS.

tar xvzf /dbfs/..../file.tar.gz -C /tmp/unpacked

and then move:

dbutils.fs.mv("file:/tmp/unpacked", "dbfs:/tmp/", True)
Alex Ott
  • 80,552
  • 8
  • 87
  • 132