Upload file to Databricks DBFS with Python API

Question

I'm following the Databricks example for uploading a file to DBFS (in my case .csv):

import json
import requests
import base64

DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'
BASE_URL = 'https://%s/api/2.0/dbfs/' % (DOMAIN)

def dbfs_rpc(action, body):
  """ A helper function to make the DBFS API request, request/response is encoded/decoded as JSON """
  response = requests.post(
    BASE_URL + action,
    headers={'Authorization': 'Bearer %s' % TOKEN },
    json=body
  )
  return response.json()

# Create a handle that will be used to add blocks
handle = dbfs_rpc("create", {"path": "/temp/upload_large_file", "overwrite": "true"})['handle']
with open('/a/local/file') as f:
  while True:
    # A block can be at most 1MB
    block = f.read(1 << 20)
    if not block:
        break
    data = base64.standard_b64encode(block)
    dbfs_rpc("add-block", {"handle": handle, "data": data})
# close the handle to finish uploading
dbfs_rpc("close", {"handle": handle})

When using the tutorial as is, I get an error:

Traceback (most recent call last):
  File "db_api.py", line 65, in <module>
    data = base64.standard_b64encode(block)
  File "C:\Miniconda3\envs\dash_p36\lib\base64.py", line 95, in standard_b64encode
    return b64encode(s)
  File "C:\Miniconda3\envs\dash_p36\lib\base64.py", line 58, in b64encode
    encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'

I tried doing with open('./sample.csv', 'rb') as f: before passing the blocks to base64.standard_b64encode but then getting another error:

TypeError: Object of type 'bytes' is not JSON serializable

This happens when the encoded block data is being sent into the API call.
I tried skipping encoding entirely and just passing the blocks into the post call. In this case the file gets created in the DBFS but has 0 bytes size.

At this point I'm trying to make sense of it all. It doesn't want a string but it doesn't want bytes either. What am I doing wrong? Appreciate any help.

score 2 · Accepted Answer · answered Jun 11 '22 at 16:25

2

In Python we have strings and bytes, which are two different entities note that there is no implicit conversion between them, so you need to know when to use which and how to convert when necessary. This answer provides nice explanation.

With the code snippet I see two issues:

This you already got - open by default reads the file as text. So your block is a string, while standard_b64encode expects bytes and returns bytes. To read bytes from file it needs to be opened in binary mode:

with open('/a/local/file', 'rb') as f:

Only strings can be encoded as JSON. There's no source code available for dbfs_rpc (or I can't find it), but apparently it expects a string, which it internally encodes. Since your data is bytes, you need to convert it to string explicitly and that's done using decode:

dbfs_rpc("add-block", {"handle": handle, "data": data.decode('utf8')})

answered Jun 11 '22 at 16:25

Kombajn zbożowy

8,755
3
28
60

thanks for your answer but i'm afraid it's more complex than that. I am aware about the difference between the data types. This code snippet comes from the Databricks API examples [link](https://docs.databricks.com/dev-tools/api/latest/examples.html#upload-a-big-file-into-dbfs). `dbfs_rpc` is defined in the snippet itself. my problem is that even when i pass a string into JSON I end up with a 0 bytes file. in other cases I get errors related to data format. I assume this issue is Databricks specific – Nik Jun 13 '22 at 16:01
Aww... yeah, I didn't notice `dbfs_rpc`. Anyway, I tried this snippet (with my two updates) to upload a file and it did the work. – Kombajn zbożowy Jun 13 '22 at 17:12
thanks @Kombajn, i tried the code as is + your modes and it worked. I actually had one previous change that appears to be the problem: i was authenticating with `.netrc` file where i stored the creds (as databricks recommends) instead of hardcoding them in the code. Since the creds were stored in a file, I wasn't passing the headers into the API call. It does successfully create a handle but the resulting file in DBFS is 0 bytes. Now onto solving the mystery of how to authenticate gracefully haha – Nik Jun 13 '22 at 19:11
thanks again for the help. funny name btw! – Nik Jun 13 '22 at 19:12

Upload file to Databricks DBFS with Python API

1 Answers1