Questions tagged [apache-iceberg]

Apache Iceberg is a high-performance table format to enable analytics purposes. It allows SQL tables to be consumed by analytics tools such as Apache Spark, Apache Flink, Apache Hive, Trino, PrestoDB, Impala, StarRocks, Doris, and Pig.

Apache Iceberg (often referred only as Iceberg) is a high-performance table format to enable analytics purposes. It allows SQL tables to be consumed by analytics tools such as Apache Spark, Apache Flink, Apache Hive, Trino, PrestoDB, Impala, StarRocks, Doris, and Pig.

68 questions

votes

0 answers

Prevent executors from copying JARs in client mode

I am trying to run a Spark job on a Kubernetes cluster and I need to add additional JARs required by my application (specifically JARs for Apache Iceberg and AWS SDK). Initially, I tried running my Spark job through spark-submit in cluster deploy…

asked Mar 21 '23 at 10:40

Shivaprasad

votes

3 answers

how to enable storage partitioned join in spark/iceberg?

How do I use the storage partitioned join feature in Spark 3.3.0? I've tried it out, and my query plan still shows the expensive ColumnarToRow and Exchange steps. My setup is as follows: joining two Iceberg tables, both partitioned on hours(ts),…

apache-spark apache-spark-sql apache-iceberg

asked Oct 03 '22 at 18:02

James D

1,580
1
13
9

votes

0 answers

performant writes to apache Iceberg

I've been sitting 2+ weeks on the topic of trying to achieve performant record writes from pandas (or ideally polars if possible) in python environment to our apache iceberg deployment (with hive metastore) directly, or via Trino query engine based…

bigdata trino iceberg apache-iceberg

asked Jul 06 '23 at 14:34

Paul

votes

1 answer

Is there a way to remove files belongs to a partition without physically delete them in iceberg?

there is add_files() to add some files from hive table to iceberg. but cannot find a way to reverse that operation other than drop the table and recreate. CALL spark_catalog.system.add_files( table => 'db.tbl', source_table =>…

apache-spark iceberg apache-iceberg

asked Mar 16 '23 at 21:53

Dyno Fu

8,753
4
39
64

votes

0 answers

Apache Iceberg on Redshift Spectrum, is it possible?

I have seen here https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-redshift-spectrum-adds-support-for-querying-open-source-apache-hudi-and-delta-lake/ that Redshift Spectrum has support for Hudi and Delta. We're using Iceberg right now as a…

amazon-web-services amazon-redshift amazon-redshift-spectrum iceberg apache-iceberg

asked Nov 10 '22 at 08:45

Mateus Leão

votes

0 answers

Partitions order when reading Iceberg table by Spark

I have a large partitioned Iceberg table ordered by some columns. Now I want to scan through some filtered parts of that table using Spark and toLocalIterator(), preserving the order. When my filter condition outputs the data from single partition…

apache-spark apache-iceberg

asked Aug 09 '22 at 14:31

Alexey Vasilyev

vote

1 answer

Write to Iceberg/Glue table from local PySpark session

I want to be able to operate (read/write) to an Iceberg table hosted on AWS Glue, from my local machine, using Python. I have already: Created an Iceberg table and registered it on AWS Glue Populated the Iceberg table with limited data using…

apache-spark pyspark aws-glue iceberg apache-iceberg

asked Aug 10 '23 at 10:29

Luiz Tauffer

vote

1 answer

Creating an Iceberg Table on S3 Using PyIceberg and Glue Catalog

I am attempting to create an Iceberg Table on S3 using the Glue Catalog and the PyIceberg library. My goal is to define a schema, partitioning specifications, and then create a table using PyIceberg. However, despite multiple attempts, I haven't…

python boto3 aws-glue iceberg apache-iceberg

asked Aug 08 '23 at 03:26

Lew

vote

1 answer

How to start Flink application from last snapshot-id in DB streaming when application was stopped

I'm creating an AWS Flink application in Java that stream from Iceberg and wondering if Flink has mechanism that providing possibility of restarting stream from last snapshot-id that was successfully processed, if the whole application is down.…

java apache-flink flink-streaming fault-tolerance apache-iceberg

asked Jun 26 '23 at 15:25

Netrunner

vote

0 answers

Connecting Iceberg's JdbcCatalog to Spark session

I have a JdbcCatalog initialized with H2 database in my local java code. It is able to create iceberg tables with proper schema and partition spec. When I create a spark session in the same class, it is unable to use the JdbcCatalog already created…

apache-spark iceberg apache-iceberg

asked Jun 04 '23 at 01:02

Ishan Das

vote

0 answers

How to insert comment in iceberg table?

everything good? I'm trying to put a comment on the ICEBRG table in glue catalog , and I used it as follows: spark.sql(f"""CREATE EXTERNAL TABLE IF NOT EXISTS {schema_name}.{table_name}({columns}) USING iceberg COMMENT 'table…

python pyspark aws-glue iceberg apache-iceberg

asked May 29 '23 at 13:39

Carlos Eduardo Bilar Rodrigues

vote

0 answers

Iceberg with Hive Metastore does not create a catalog in Spark and uses default

I have been experiencing some (unexpected?) behavior where a catalog reference in Spark is not reflected in the Hive Metastore. I have followed the Spark configuration according to the documentation, which looks like it should create a new catalog…

apache-spark hive hive-metastore iceberg apache-iceberg

asked May 10 '23 at 09:53

thijsvdp

vote

2 answers

How to use DBT with AWS Athena with Apache Iceberg tables

We have DBT models which we use to run on AWS Athena tables. It creates Hive external tables behind the scenes. Now we have a situation where the data type of column may change in future. Athena tables based on Hive does not allow to change data…

hive amazon-athena dbt apache-iceberg

asked Apr 20 '23 at 13:52

azaveri7

vote

1 answer

How to run VACUUM and OPTIMIZE SQL statements in Amazon Athena for Apache Iceberg v2 table

Basing on that documentation page: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html For having following Apache Iceberg table: CREATE TABLE IF NOT EXISTS my_catalog.my_database.my_table ( id string, …

amazon-web-services apache-spark-sql amazon-athena apache-iceberg

asked Mar 29 '23 at 08:11

PatrykMilewski

vote

1 answer

Retain latest record by differentiate "actual NULL in data" vs "NULL due to field not being present" in record using trino

I have an iceberg table like below and trying to run a query using trino to provide the expected output. Sample Data trino:datalakepartncr_trino> select src.ID, src.OUTSTANDING1, src.OUTSTANDING2, src.OUTSTANDINGP, src.LASTACTION,…

sql presto trino apache-iceberg

asked Mar 22 '23 at 07:41

arunb2w

1,196
9
28

2 3 4 5 Next