Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
13
votes
1 answer

lakeFS, Hudi, Delta Lake merge and merge conflicts

I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS. Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support…
alexanoid
  • 24,051
  • 54
  • 210
  • 410
5
votes
1 answer

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am trying to read data from hudi but getting below error Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html I am able to read the data from Hudi…
radhika sharma
  • 499
  • 1
  • 9
  • 28
4
votes
1 answer

Duplicates record keys in apache HUDI

HUDI does not seem to deduplicate records in some cases. Below is the configuration that we use. We partition the data by customer_id, so our expectation is that HUDI will enforce uniqueness within the partition, i.e each customer_id folder.…
Mandar
  • 498
  • 1
  • 4
  • 17
4
votes
0 answers

Using Apache hudi library in java clients

I am a hudi newbie. I was wondering if Hudi client libraries can be used straight from java clients to write to Amazon S3 folders. I am trying to build a system that can store a large no. of events upto 50k/second that will be emitted from a…
Code Junkie
  • 1,429
  • 2
  • 10
  • 9
4
votes
1 answer

Spark-Hudi: Save as table to Glue/Hive catalog

Scenario: Store Hudi Spark dataframe using saveAsTable(data frame writer) method, such that Hudi supported table with org.apache.hudi.hadoop.HoodieParquetInputFormat Input format schema is automaticaly generated. Currently, saveAsTable works fine…
3
votes
2 answers

Example for CREATE TABLE on TRINO using HUDI

I am using Spark Structured Streaming (3.1.1) to read data from Kafka and use HUDI (0.8.0) as the storage system on S3 partitioning the data by date. (no problems with this section) I am looking to use Trino (355) to be able to query that data. As a…
gunj_desai
  • 782
  • 6
  • 19
3
votes
1 answer

Can you run a transactional data lake (Hudi, Delta Lake) with multiple EMR clusters

I’m looking into several “transactional data lake” technologies such as Apache Hudi, Delta Lake, AWS Lake Formation Governed Tables. Except for the latter, I can’t see how these would work in a multi cluster environment. I’m baselining against s3…
zachd1_618
  • 4,210
  • 6
  • 34
  • 47
3
votes
1 answer

Can the Hudi Metadata table be queried?

Going through the Hudi documentation I saw the Metadata Config section and was curious about how it is used. I created a table enabling the metadata and the directory got created under /.hoodie/metadata. Has anybody experimented with this feature?…
Oscar Drai
  • 141
  • 1
  • 7
3
votes
1 answer

Unable to write non parititoned table using Apache Hudi

I'm using Apache Hudi to write non partitioned table to AWS S3 and sync that to hive. Here's the DataSourceWriteOptions being used. val hudiOptions: Map[String, String] = Map[String, String]( DataSourceWriteOptions.TABLE_TYPE_OPT_KEY ->…
Bharat Kul Ratan
  • 985
  • 2
  • 12
  • 24
3
votes
2 answers

Apache Hudi Partitioning with custom format

I am currently doing a POC on Apache Hudi with spark(scala). I am facing a problem while saving a dataframe with partitioning. Hudi saves the dataframe with path/valueOfPartitionCol1/valueOfPartitionCol2.... using the property…
2
votes
1 answer

How to provide TypeInformation for a GenericRowData type Objects in Flink

I am using a deserializer to parse a Kafka Stream (of JSON Strings) and I'm then using the GeneericRowData class to convert the Object node to type RowData instance, which is supported by hudi to write directly from DataStream. I'm expected to…
Aman Vaishya
  • 179
  • 12
2
votes
1 answer

Hudi overwriting the tables with back date data

I am pushing some initial bulk data into a hudi table, and then every day, I write incremental data into it. But if back data arrives, then the latest precombined field which is already in the table is ignored and the arriving precombined…
awadhesh14
  • 89
  • 7
2
votes
1 answer

Apache Hudi on Dataproc

Is there any guide to deploy Apache Hudi on a Dataproc Cluster? i´m trying to deploy via Hudi Quick Start Guide but i can´t. Spark 3.1.1 Python 3.8.13 Debian 5.10.127 x86_64 launch code: pyspark --jars…
2
votes
1 answer

Pyspark streaming from Kafka to Hudi

I'm new using hudi and I have a problem. I'm working with an EMR in AWS with pyspark, Kafka and what I want to do is to read a topic from the Kafka cluster with pyspark streaming and then move it to S3 in hudi format. To be honest I've tried a lot…
2
votes
0 answers

Hudi errors with 'DELETE is only supported with v2 tables.'

I'm trying out Hudi, Delta Lake, and Iceberg in AWS Glue v3 engine (Spark 3.1) and have both Delta Lake and Iceberg running just fine end to end using a test pipeline I built with test data. Note I am not using any of the Glue Custom Connectors. I'm…
1
2 3
10 11