Questions tagged [apache-iceberg]

Apache Iceberg is a high-performance table format to enable analytics purposes. It allows SQL tables to be consumed by analytics tools such as Apache Spark, Apache Flink, Apache Hive, Trino, PrestoDB, Impala, StarRocks, Doris, and Pig.

Apache Iceberg (often referred only as Iceberg) is a high-performance table format to enable analytics purposes. It allows SQL tables to be consumed by analytics tools such as Apache Spark, Apache Flink, Apache Hive, Trino, PrestoDB, Impala, StarRocks, Doris, and Pig.

68 questions
3
votes
0 answers

Prevent executors from copying JARs in client mode

I am trying to run a Spark job on a Kubernetes cluster and I need to add additional JARs required by my application (specifically JARs for Apache Iceberg and AWS SDK). Initially, I tried running my Spark job through spark-submit in cluster deploy…
Shivaprasad
  • 399
  • 2
  • 5
  • 12
3
votes
3 answers

how to enable storage partitioned join in spark/iceberg?

How do I use the storage partitioned join feature in Spark 3.3.0? I've tried it out, and my query plan still shows the expensive ColumnarToRow and Exchange steps. My setup is as follows: joining two Iceberg tables, both partitioned on hours(ts),…
James D
  • 1,580
  • 1
  • 13
  • 9
2
votes
0 answers

performant writes to apache Iceberg

I've been sitting 2+ weeks on the topic of trying to achieve performant record writes from pandas (or ideally polars if possible) in python environment to our apache iceberg deployment (with hive metastore) directly, or via Trino query engine based…
Paul
  • 756
  • 1
  • 8
  • 22
2
votes
1 answer

Is there a way to remove files belongs to a partition without physically delete them in iceberg?

there is add_files() to add some files from hive table to iceberg. but cannot find a way to reverse that operation other than drop the table and recreate. CALL spark_catalog.system.add_files( table => 'db.tbl', source_table =>…
Dyno Fu
  • 8,753
  • 4
  • 39
  • 64
2
votes
0 answers

Apache Iceberg on Redshift Spectrum, is it possible?

I have seen here https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-redshift-spectrum-adds-support-for-querying-open-source-apache-hudi-and-delta-lake/ that Redshift Spectrum has support for Hudi and Delta. We're using Iceberg right now as a…
2
votes
0 answers

Partitions order when reading Iceberg table by Spark

I have a large partitioned Iceberg table ordered by some columns. Now I want to scan through some filtered parts of that table using Spark and toLocalIterator(), preserving the order. When my filter condition outputs the data from single partition…
1
vote
1 answer

Write to Iceberg/Glue table from local PySpark session

I want to be able to operate (read/write) to an Iceberg table hosted on AWS Glue, from my local machine, using Python. I have already: Created an Iceberg table and registered it on AWS Glue Populated the Iceberg table with limited data using…
Luiz Tauffer
  • 463
  • 6
  • 17
1
vote
1 answer

Creating an Iceberg Table on S3 Using PyIceberg and Glue Catalog

I am attempting to create an Iceberg Table on S3 using the Glue Catalog and the PyIceberg library. My goal is to define a schema, partitioning specifications, and then create a table using PyIceberg. However, despite multiple attempts, I haven't…
Lew
  • 11
  • 4
1
vote
1 answer

How to start Flink application from last snapshot-id in DB streaming when application was stopped

I'm creating an AWS Flink application in Java that stream from Iceberg and wondering if Flink has mechanism that providing possibility of restarting stream from last snapshot-id that was successfully processed, if the whole application is down.…
1
vote
0 answers

Connecting Iceberg's JdbcCatalog to Spark session

I have a JdbcCatalog initialized with H2 database in my local java code. It is able to create iceberg tables with proper schema and partition spec. When I create a spark session in the same class, it is unable to use the JdbcCatalog already created…
Ishan Das
  • 11
  • 3
1
vote
0 answers

How to insert comment in iceberg table?

everything good? I'm trying to put a comment on the ICEBRG table in glue catalog , and I used it as follows: spark.sql(f"""CREATE EXTERNAL TABLE IF NOT EXISTS {schema_name}.{table_name}({columns}) USING iceberg COMMENT 'table…
1
vote
0 answers

Iceberg with Hive Metastore does not create a catalog in Spark and uses default

I have been experiencing some (unexpected?) behavior where a catalog reference in Spark is not reflected in the Hive Metastore. I have followed the Spark configuration according to the documentation, which looks like it should create a new catalog…
thijsvdp
  • 404
  • 3
  • 16
1
vote
2 answers

How to use DBT with AWS Athena with Apache Iceberg tables

We have DBT models which we use to run on AWS Athena tables. It creates Hive external tables behind the scenes. Now we have a situation where the data type of column may change in future. Athena tables based on Hive does not allow to change data…
azaveri7
  • 793
  • 3
  • 18
  • 48
1
vote
1 answer

How to run VACUUM and OPTIMIZE SQL statements in Amazon Athena for Apache Iceberg v2 table

Basing on that documentation page: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html For having following Apache Iceberg table: CREATE TABLE IF NOT EXISTS my_catalog.my_database.my_table ( id string, …
1
vote
1 answer

Retain latest record by differentiate "actual NULL in data" vs "NULL due to field not being present" in record using trino

I have an iceberg table like below and trying to run a query using trino to provide the expected output. Sample Data trino:datalakepartncr_trino> select src.ID, src.OUTSTANDING1, src.OUTSTANDING2, src.OUTSTANDINGP, src.LASTACTION,…
arunb2w
  • 1,196
  • 9
  • 28
1
2 3 4 5