When a dataframe is saved into a SQL database, Spark will only writes into those columns in the database that are present in the dataframe. So if the ID
column is not part of the dataframe, Spark will ignore it during the insert operation.
The insert statement is created in JdbcUtils.getInsertStatement(), and here only columns that are part of the underlying rdd are included in the create statement:
val columns =
[...]
rddSchema.fields.map { col =>
val normalizedName = tableColumnNames.find(f => columnNameEquality(f, col.name)).getOrElse {
throw new AnalysisException(s"""Column "${col.name}" not found in schema $tableSchema""")
}
dialect.quoteIdentifier(normalizedName)
}.mkString(",")
[...]
s"INSERT INTO $table ($columns) VALUES ($placeholders)"
For example given the table definition
create table address (
id serial,
FirstName varchar(20),
LastName varchar(20),
CreOn timestamp,
CreBy varchar(20),
constraint pk primary key (id))
and the Python code
df = spark.createDataFrame(
[("John", "Doe", "1970-01-02 03:46:40", "py2")],
['FirstName','LastName','CreOn','CreBy']
)
df.write.mode("append").jdbc(<jdbc url>, "address", \
properties={"driver":...,"user": ..., "password": ...})
Spark creates the insert statement
INSERT INTO address ("firstname","lastname","creon","creby") VALUES (?,?,?,?)
and the insert operation is successful.
So autogenerated fields should simply not be part of the dataframe and the error Permission denied
is probably not related to the autogenerated field.