0

I am trying to write R notebook in Azure Databricks workspace to download data from datalake. My code:

install.packages("reticulate")

%sh
pip install azure-storage-blob

library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta

account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'

blob_service = BaseBlobService(
    account_name=account_name,
    account_key=account_key
)

sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))

blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas

I got error: Error in py_run_string_impl(code, local, convert): ModuleNotFoundError: No module named 'azure'

Any idea how to solve that problem?

UPDATE I used the code that @RithwikBojja recommended:

library(reticulate)
py_run_string("
import subprocess
subprocess.check_call(['pip','install','azure-storage-blob'])
from datetime import datetime, timedelta
from azure.storage.blob import BlobServiceClient
from azure.storage.blob import ContainerClient

blob_storage_account = 'vamblob'
blob_storage_container = 'pool'
sas_token = 'sp=rM0zem1eSt5dEGrXdu2KUp9ROcvN7furA0pk%3D'
url = 'https://'+blob_storage_account+'.blob.core.windows.net/'+blob_storage_container
container_client = ContainerClient.from_container_url(
    container_url=url,
    credential=sas_token
)
y=container_client.list_blobs()
")

but I remove from it final for loop and it works (no errors appears):

for x in y: print(x.name)

But then I don't know what to do next, I want to get list of files in given path on load them. I run: py_list_attributes(py)

and I got:

"BlobServiceClient"      
"ContainerClient"       
"__annotations__"         
"__builtins__"           
"__doc__"      
"__file__"               
"__loader__"             
"__name__"     
"__package__"            
"__spec__"               
"abs_file"     
"base"                   
"bin_dir"               
"blob_storage_account"   
"blob_storage_container" 
"container_client"       
"datetime"               
"lib"               
"os"                     
"path"                   
"prev_length"  
"r"                      
"sas_token"              
"site"         
"subprocess"             
"sys"                    
"timedelta"    
"url"                    
"y"

When I run:

  • py$path I get /databricks/python3/lib/python3.9/site-packages
  • py$fileI get/databricks/python3/bin/activate_this.py`

What I should do next?

tomsu
  • 371
  • 2
  • 16

1 Answers1

0

I have reproduced in my environment and got expected results. I have also got similar error at first as below: enter image description here

Then i followed below process: Firstly, I have executed below command:

%r
install.packages("reticulate")

Then i created a sub process in python code installed azure storage blob inside the code itself and modified your given code and I followed Microsoft-Document:

library(reticulate)
py_run_string("
import subprocess

subprocess.check_call(['pip','install','azure-storage-blob'])
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient 
 
blob_service_client = BlobServiceClient.from_connection_string('XX')

container_name = 'pool'   

container_client = blob_service_client.get_container_client(container_name)

blob_list = container_client.list_blobs()
for blob in blob_list:
    print('\t' + blob.name)
")

XX- Connection string

enter image description here

Here I have not used BaseBlobService and BlobPermissions because they are not present in new version of azure storage blob pacakage.

Output:

enter image description here

Edit: Using SAS token

library(reticulate)
py_run_string("
import subprocess
subprocess.check_call(['pip','install','azure-storage-blob'])
from datetime import datetime, timedelta
from azure.storage.blob import BlobServiceClient
from azure.storage.blob import ContainerClient

blob_storage_account = 'vamblob'
blob_storage_container = 'pool'
sas_token = 'sp=rM0zem1eSt5dEGrXdu2KUp9ROcvN7furA0pk%3D'
url = 'https://'+blob_storage_account+'.blob.core.windows.net/'+blob_storage_container
container_client = ContainerClient.from_container_url(
    container_url=url,
    credential=sas_token
)
y=container_client.list_blobs()
for x in y:
 print(x.name)
")

enter image description here

References Taken from:

RithwikBojja
  • 5,069
  • 2
  • 3
  • 7
  • thank you very much for your interest in the topic! The problem is that I don't have connection string, I have only Blob SAS Token. Do you know how I could use SAS Token in this case? – tomsu Dec 07 '22 at 12:12
  • @tomsu I have edited my answer according to your requirement. – RithwikBojja Dec 07 '22 at 14:01
  • Hi @RithwikBojja I tried it, and I got error: Error in py_run_string_impl(code, local, convert) : azure.core.exceptions.HttpResponseError: The requested URI does not represent any resource on the server. – tomsu Dec 08 '22 at 11:49
  • Have you given correctly sas token and everthing which i have gave as it as solved for me and got expected results as i have shown in image too – RithwikBojja Dec 08 '22 at 12:12
  • I edited my main post, please check – tomsu Dec 13 '22 at 15:34
  • Try using y.name . **py_list_attributes(py)** gives attributes not list names. I guess your indentation might be wrong for loop. Try [link](https://wiki.python.org/moin/ForLoop) to use while loop – RithwikBojja Dec 13 '22 at 16:34