5

Trying to read delta log file in databricks community edition cluster. (databricks-7.2 version)

df=spark.range(100).toDF("id")
df.show()
df.repartition(1).write.mode("append").format("delta").save("/user/delta_test")

with open('/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
  for l in f:
    print(l)

Getting file not found error:

FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<command-1759925981994211> in <module>
----> 1 with open('/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
      2   for l in f:
      3     print(l)

FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'

I have tried with adding /dbfs/,dbfs:/ nothing got worked out,Still getting same error.

with open('/dbfs/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
  for l in f:
    print(l)

But using dbutils.fs.head i was able to read the file.

dbutils.fs.head("/user/delta_test/_delta_log/00000000000000000000.json")

'{"commitInfo":{"timestamp":1598224183331,"userId":"284520831744638","userName":"","operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"notebook":{"","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputBytes":"1171","numOutputRows":"100"}}}\n{"protocol":{"minReaderVersi...etc

How can we read/cat a dbfs file in databricks with python open method?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
ashley
  • 53
  • 1
  • 6

1 Answers1

11

By default, this data is on the DBFS, and your code need to understand how to access it. Python doesn't know about it - that's why it's failing.

But there is a workaround - DBFS is mounted to the nodes at /dbfs, so you just need to append it to your file name: instead of /user/delta_test/_delta_log/00000000000000000000.json, use /dbfs/user/delta_test/_delta_log/00000000000000000000.json

update: on community edition, in DBR 7+, this mount is disabled. The workaround would be to use dbutils.fs.cp command to copy file from DBFS to local directory, like, /tmp, or /var/tmp, and then read from it:

dbutils.fs.cp("/file_on_dbfs", "file:///tmp/local_file")

please note that if you don't specify URI schema, then the file by default is referring DBFS, and to refer the local file you need to use file:// prefix (see docs).

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks @AlexOtt, I have tried using `/dbfs` and used python open file api.. `with open('/dbfs/user/delta_test/_delta_log/00000000000000000000.json','r') as f: for l in f: print(l)` still getting same error.. do I need to **manually** create `mount` point to access the file before using in python way? – ashley Sep 10 '20 at 01:26
  • 1
    It looks like that it depends on the DBR version on the community - it works just fine with DBR 6.6, but `/dbfs/` is empty on DBR 7.2 – Alex Ott Sep 10 '20 at 07:28
  • Hi Alex, could you elaborate how to workaround this issue through `dbutils.fs.cp`? I have tried the following: `dbutils.fs.cp("databricks-datasets/README.md", "/tmp/README.md")` --> worked `%fs ls /tmp/README.md` --> returns path to `"dbfs:/tmp/README.md"` `f = open("/tmp/README.md", "r")` --> `FileNotFoundError: [Errno 2] No such file or directory: '/tmp/README.md'` – Ying Xiong Sep 04 '21 at 05:07
  • 1
    I've added the code example to the answer. By default if you don't specify schema, then all references are going to the DBFS file. To use local files, use `file://` – Alex Ott Sep 04 '21 at 08:20
  • @AlexOtt, Hi Alex, this answer is super helpful. But I have one question. This `file:///` directory is not on DBFS, so where is this dir located? I searched my local machine, but no such dir is being created. I appreciate your help! – Vae Jiang Dec 24 '22 at 02:01
  • it will be located on the driver node – Alex Ott Dec 24 '22 at 08:52