2

I'm a trying to create a table with spark.catalog.createTable. It needs to have a partition column named "id".

Based on How can I use "spark.catalog.createTable" function to create a partitioned table? which is in Scala, I tried :

df = spark.range(10).withColumn("foo", F.lit("bar"))

spark.catalog.createTable("default.test_partition", schema=df.schema, **{"partitionColumnNames":"id"})

But it is not working. It creates a table in Hive with these properties :

CREATE TABLE default.test_partition (   id BIGINT,   foo STRING ) 
WITH SERDEPROPERTIES ('partitionColumnNames'='id' ...

The DDL of the table should actually be:

CREATE TABLE default.test_partition (   foo STRING ) 
PARTITIONED BY (   id BIGINT ) 
WITH SERDEPROPERTIES (...

The signature of the method is :

Signature: spark.catalog.createTable(tableName, path=None, source=None, schema=None, **options)

So, I believe there is a special argument in **options to create the partition, but I tried "partitionColumnNames", "partitionBy", "partition" ... none of them is working. Do you know what is the proper keyword for that ?


EDIT: If you are wondering why I want to use this method, there are 2 reasons :

  1. For personnal curiosity, to know what is doable or not
  2. I want to use "dynamic partitions" with spark, which requires that I use insertInto method (see Overwrite specific partitions in spark dataframe write method). But this method needs the table to be created first, action that I want to perform with the spark.catalog.createTable because it seems right.
Steven
  • 14,048
  • 6
  • 38
  • 73
  • I have not worked with spark.catalog yet but looking at the source code [here](https://github.com/apache/spark/blob/master/python/pyspark/sql/catalog.py) , looks like the `options` kwarg is only used when schema is not provided. `if schema is None: df = self._jcatalog.createTable(tableName, source, description, options)`. It doesnot look like they are using that kwarg for partitioning – anky Sep 28 '21 at 15:59
  • Also just wondering if you are trying to automate any ddl process? – anky Sep 28 '21 at 16:27
  • @anky I added an edit to explain the reason. Yep, that is some kind of automated ddl process. But I am open if you have some better ideas. – Steven Sep 28 '21 at 16:50
  • I see. I have added an answer based on what I understood, but that can be customized as per needs as well :) – anky Sep 28 '21 at 16:55

1 Answers1

1

Looking at the source code for spark.catalog here , it looks like that the keyword argument options is an alternative for schema and is only used when the schema parameter is not passed. This can be seen below:

"Optionally, a schema can be provided as the schema of the returned" #for options

if path is not None:
            options["path"] = path
            ...................
            ...........

        if schema is None: #this line and the line below
            df = self._jcatalog.createTable(tableName, source, description, options)
        else:
            if not isinstance(schema, StructType):
                raise TypeError("schema should be StructType")
            scala_datatype = self._jsparkSession.parseDataType(schema.json())
            df = self._jcatalog.createTable(
                tableName, source, scala_datatype, description, options)
        return DataFrame(df, self._sparkSession._wrapped)

However, if you are looking to create an automated DDL process, something along the lines of the below function might help you:

def mycreateTable(tablename,schema,partitioncols):
  schema_json = schema.json()
  ddlstring = (spark.sparkContext._jvm.org.apache.spark.sql.types.
                    DataType.fromJson(schema_json).toDDL())
  #if you dont want to DROP the table when it exists change the below line
  spark.sql(f"""DROP TABLE IF EXISTS {tablename} ;""") 

  spark.sql(f"""
  CREATE TABLE {tablename} ({ddlstring}) partitioned by ({','.join(partitioncols)}) """)

Now executing the below should work:

mycreateTable("default.test_partition",df.schema,['id'])

enter image description here

anky
  • 74,114
  • 11
  • 41
  • 70