Highest Voted 'apache-hudi' Questions

13

votes

1 answer

lakeFS, Hudi, Delta Lake merge and merge conflicts

I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS. Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support…

asked Oct 03 '21 at 17:34

alexanoid

24,051
54
210
410

5

votes

1 answer

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am trying to read data from hudi but getting below error Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html I am able to read the data from Hudi…

apache-spark google-cloud-dataproc apache-hudi

asked Jun 13 '22 at 04:01

radhika sharma

499
1
9
28

4

votes

1 answer

Duplicates record keys in apache HUDI

HUDI does not seem to deduplicate records in some cases. Below is the configuration that we use. We partition the data by customer_id, so our expectation is that HUDI will enforce uniqueness within the partition, i.e each customer_id folder.…

apache-hudi

asked Apr 27 '22 at 06:55

Mandar

498
1
4
17

4

votes

0 answers

Using Apache hudi library in java clients

I am a hudi newbie. I was wondering if Hudi client libraries can be used straight from java clients to write to Amazon S3 folders. I am trying to build a system that can store a large no. of events upto 50k/second that will be emitted from a…

apache-hudi

asked Feb 14 '21 at 21:56

Code Junkie

1,429
2
10
9

4

votes

1 answer

Spark-Hudi: Save as table to Glue/Hive catalog

Scenario: Store Hudi Spark dataframe using saveAsTable(data frame writer) method, such that Hudi supported table with org.apache.hudi.hadoop.HoodieParquetInputFormat Input format schema is automaticaly generated. Currently, saveAsTable works fine…

apache-spark pyspark hive aws-glue apache-hudi

asked Dec 22 '20 at 16:20

kushal bajaj

73
8

3

votes

2 answers

Example for CREATE TABLE on TRINO using HUDI

I am using Spark Structured Streaming (3.1.1) to read data from Kafka and use HUDI (0.8.0) as the storage system on S3 partitioning the data by date. (no problems with this section) I am looking to use Trino (355) to be able to query that data. As a…

apache-spark create-table trino apache-hudi

asked Dec 23 '21 at 10:19

gunj_desai

782
6
19

3

votes

1 answer

Can you run a transactional data lake (Hudi, Delta Lake) with multiple EMR clusters

I’m looking into several “transactional data lake” technologies such as Apache Hudi, Delta Lake, AWS Lake Formation Governed Tables. Except for the latter, I can’t see how these would work in a multi cluster environment. I’m baselining against s3…

amazon-web-services amazon-emr delta-lake apache-hudi

asked Oct 16 '21 at 02:19

zachd1_618

4,210
6
34
47

3

votes

1 answer

Can the Hudi Metadata table be queried?

Going through the Hudi documentation I saw the Metadata Config section and was curious about how it is used. I created a table enabling the metadata and the directory got created under /.hoodie/metadata. Has anybody experimented with this feature?…

pyspark apache-hudi

asked Feb 10 '21 at 12:41

Oscar Drai

141
1
7

3

votes

1 answer

Unable to write non parititoned table using Apache Hudi

I'm using Apache Hudi to write non partitioned table to AWS S3 and sync that to hive. Here's the DataSourceWriteOptions being used. val hudiOptions: Map[String, String] = Map[String, String]( DataSourceWriteOptions.TABLE_TYPE_OPT_KEY ->…

apache-spark hadoop hive apache-hudi

asked Oct 21 '20 at 05:39

Bharat Kul Ratan

985
2
12
24

3

votes

2 answers

Apache Hudi Partitioning with custom format

I am currently doing a POC on Apache Hudi with spark(scala). I am facing a problem while saving a dataframe with partitioning. Hudi saves the dataframe with path/valueOfPartitionCol1/valueOfPartitionCol2.... using the property…

apache-spark apache-hudi

asked Dec 16 '19 at 12:25

Himanshu Singhvi

31
3

2

votes

1 answer

How to provide TypeInformation for a GenericRowData type Objects in Flink

I am using a deserializer to parse a Kafka Stream (of JSON Strings) and I'm then using the GeneericRowData class to convert the Object node to type RowData instance, which is supported by hudi to write directly from DataStream. I'm expected to…

apache-flink flink-streaming apache-hudi

asked May 11 '23 at 07:07

Aman Vaishya

179
12

2

votes

1 answer

Hudi overwriting the tables with back date data

I am pushing some initial bulk data into a hudi table, and then every day, I write incremental data into it. But if back data arrives, then the latest precombined field which is already in the table is ignored and the arriving precombined…

apache-spark stream apache-hudi

asked Dec 14 '22 at 07:52

awadhesh14

89
7

2

votes

1 answer

Apache Hudi on Dataproc

Is there any guide to deploy Apache Hudi on a Dataproc Cluster? i´m trying to deploy via Hudi Quick Start Guide but i can´t. Spark 3.1.1 Python 3.8.13 Debian 5.10.127 x86_64 launch code: pyspark --jars…

apache-spark pyspark google-cloud-dataproc apache-hudi

asked Dec 01 '22 at 17:06

Cir02

107
7

2

votes

1 answer

Pyspark streaming from Kafka to Hudi

I'm new using hudi and I have a problem. I'm working with an EMR in AWS with pyspark, Kafka and what I want to do is to read a topic from the Kafka cluster with pyspark streaming and then move it to S3 in hudi format. To be honest I've tried a lot…

python apache-spark apache-kafka amazon-emr apache-hudi

asked Sep 28 '22 at 05:49

Valle1208

43
4

2

votes

0 answers

Hudi errors with 'DELETE is only supported with v2 tables.'

I'm trying out Hudi, Delta Lake, and Iceberg in AWS Glue v3 engine (Spark 3.1) and have both Delta Lake and Iceberg running just fine end to end using a test pipeline I built with test data. Note I am not using any of the Glue Custom Connectors. I'm…

apache-hudi

asked May 18 '22 at 19:46

Daron Scarborough

21
1
3

Questions tagged [apache-hudi]