0

I have a delta table in hdfs stored as a hive table. I need to connect to the table and load the latest version of the table. I was able to connect to hdfs using pyarrow library. But it is loading entire versions on the hdfs. Here is my code

import pyarrow as pa
import pyarrow.fs as fs
import pyarrow.parquet as pq
import pyarrow.dataset as ds

fs = pa.hdfs.connect(host=ip,port=port)
dt = fs.read_parquet('/path/to/file/in/hdfs')
dt.to_pandas()

But here I am getting entire historical data in the table. Is there an option to specify that I am loading a delta table in pyarrow?

Another approach I tried is using deltalake library. Here I was not able to connect to a hdfs location. Please check the code below

from deltalake import DeltaTable
table_path_hdfs ="hdfs://ip:port/path/to/file/in/hdfs"
dt = DeltaTable(table_path_hdfs)

*While running the code I am getting error deltalake.PyDeltaTableError: Delta-rs must be build with feature 'hdfs' to support loading from: hdfs://

Is there way we can load delta-rs with hdfs support?

Can anybody suggest any other libraries for this?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Josin Mathew
  • 45
  • 2
  • 9
  • I'm confused. You want to read an HDFS file without having to build HDFS support in your library? – OneCricketeer Jul 01 '23 at 02:18
  • No basically I need to know if there is a way to read delta table stored in hdfs using python. In case delta-rs doesn't support hdfs is there any other libraries we can use? – Josin Mathew Jul 01 '23 at 18:20
  • Pyspark or Pyflink could be used. But delta-rs will work, you just need to compile with hdfs support first – OneCricketeer Jul 01 '23 at 20:11

0 Answers0