3

I have a HDInsight cluster running Spark 1.6.2 & Jupyter In a jupyter notebook I run my pyspark commands and some of the output is processed in pandas dataframe.

As the last step I would like to save out my pandas dataframe to a csv file and either:

  1. save it to the 'jupyter filesystem' and download it to my laptop
  2. save it to my blob storage

But I have no clue how to do that.

I tried the following for:

1. save it to the 'jupyter filesystem' and download it to my laptop

# df is my resulting dataframe, so I save it to the filesystem where jupyter runs
df.to_csv('app_keys.txt')

I was expecting it to save in the same directory as my notebook and thus to see it in the tree view in the browser. This is not the case. So my question is: Where is this file saved on the filesystem?

2. save it to my blob storage After googling it seems I could also upload the file to blob storage using the azure.storage.blob module. So I tried:

from azure.storage.blob import BlobService # a lot of examples online import BlockBlobService but this one is not available in HDInsight

# i have all variables in CAPITALS provided in the code
blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)

# check if reading from blob works
blob_service.get_blob_to_path(CONTAINERNAME, 'iris.txt', 'mylocalfile.txt') # this works

# now try to reverse the process and write to blob
blob_service.create_blob_from_path(CONTAINERNAME,'myblobfile.txt','mylocalfile.txt')   # fails with AttributeError: 'BlobService' object has no attribute 'create_blob_from_path'

or

blob_service.create_blob_from_text(CONTAINERNAME,'myblobfile.txt','mylocalfile.txt') # fails with 'BlobService' object has no attribute 'create_blob_from_text'

So I have no clue how I can write back and access the stuff I write out from my pandas to the filesystem.

Any help is apprciated

Geoffrey Stoel
  • 1,300
  • 3
  • 14
  • 24

1 Answers1

0

Per my experience, the second question that you encountered is due to the version of the azure storage client library for python. For the old version, the library doesn't include the method which you invoked in your code. The following URL is useful for you.

How to import Azure BlobService in python?.

Community
  • 1
  • 1
johnny
  • 319
  • 1
  • 7
  • thanks... I am indeed on 0.20 which is the default version in the azure hd insight cluster. what is the simplest way to upgrade this across the cluster? SSH into every machine and do pip install blob-storage --upgrade ? – Geoffrey Stoel Oct 06 '16 at 08:29
  • The method which you mentioned may be the simplest and useful way. – johnny Oct 06 '16 at 08:41