1

How to unzip the zip file in the data assets of the Watson Data Platform?

from io import BytesIO
import zipfile

zip_ref = zipfile.ZipFile(BytesIO(streaming_body_1.read()), 'r')
zip_ref.extractall(WHICH DIRECTORY FOR THE DATA ASSETS)
zip_ref.close()

streaming_body_1 is the zip file streaming body object in the DATA ASSETS section. I uploaded the zip file to the DATA ASSETS.

How can I unzip the zip file in the Data Assets?

Since I don't know the exact Key Path of the DATA ASSETS section.

I am trying to do this in the jupyter notebook of the project.

Thank you!

dlsnfl37
  • 23
  • 9

1 Answers1

3

When you upload a file to your project it is stored in the project's assigned cloud storage, which should now be Cloud Object Storage by default. (Check your project settings.) To work with uploaded files (which are just one type of data asset, there are others) in a notebook you'll have to first download it from the cloud storage to make it accessible in the kernel's file system and then perform the desired file operation (e.g. read, extract, ...)

Assuming you've uploaded your ZIP file you should be able to generate code that reads the ZIP file using the tooling:

  • click the 1010 (Data icon) on the upper right hand side

  • select "Insert to code" > "Insert StreamingBody object"

  • consume the StreamingBody as desired

I ran a quick test and it worked like a charm:

...
# "Insert StreamingBody object" generated code
...
from io import BytesIO
import zipfile

zip_ref = zipfile.ZipFile(BytesIO(streaming_body_1.read()), 'r')
print zip_ref.namelist()
zip_ref.close()

Edit 1: If your archive is a compressed tar file use the following code instead:

...
# "Insert StreamingBody object" generated code
...
import tarfile
from io import BytesIO
tf = tarfile.open(fileobj=BytesIO(streaming_body_1.read()), mode="r:gz") 
tf.getnames()

Edit 2: To avoid the read timeout you'll have to change the generated code from

config=Config(signature_version='oauth'),

to

config=Config(signature_version='oauth',connect_timeout=50, read_timeout=70),

With those changes in place I was able to download and extract training_data.tar.gz from the repo you've mentioned.

ptitzler
  • 923
  • 4
  • 8
  • Hi, This gives me the 'ReadTimeoutError: HTTPSConnectionPool(host='s3-api.us-geo.objectstorage.service.networklayer.com', port=443): Read timed out.' error. The file size is 525.6MB. – dlsnfl37 Jan 17 '18 at 01:48
  • I am trying to unzip the git file from https://github.com/adeshpande3/LSTM-Sentiment-Analysis :) but was not able to run due to the time out error :( – dlsnfl37 Jan 17 '18 at 02:18
  • 1
    @dlsnfl37 You need to use tarfile package https://stackoverflow.com/questions/30887979/i-want-to-create-a-script-for-unzip-tar-gz-file-via-python tar = tarfile.open(fname, "r:gz") for your code tarfile.open(BytesIO(streaming_body_1.read()), "r:gz") – charles gomes Jan 17 '18 at 17:35
  • Updated my answer based on @charles gomes input and your earlier comment – ptitzler Jan 17 '18 at 22:38
  • Thank you!! Yes the first code is absolutely working for the zipfile and your next code using the tarfile library is working perfectly. Much appreciated :) – dlsnfl37 Jan 21 '18 at 10:45