2

So right now, I have a Dataframe created using the session.createDataFrame() in Python. The intention is to append this Dataframe to an existing table object in Snowflake.

However the schema of the source dataframe doesn't match exactly with the schema of the target table. In Snowpark Scala, the DataFrameWriter object has the method option() Saving/Appending Dataframe to a table that allows the specification of column order, and hence allows for skipping columns from the dataframe as the columns could be matched by their names.

However, Snowpark Python lacks the option() for DataframeWriter at the moment. This forces Snowflake to look for the schemas and the count of columns (between source and target ) to match, else an error is thrown.

Not sure when Snowpark for Python would receive this feature, but in the interim, is there any alternative to this (apart from hardcoding columns names in the INSERT query) ?

halfwind22
  • 329
  • 4
  • 18

2 Answers2

1

You are right that Snowpark does not make inserting novel records easy. But it is possible. I did it with the Snowpark Java SDK that lacked any source/docs just banging my head on the desk until it worked.

I first did a select against the target table (see first line), then got the schema, then created a new Row object with the correct order and types. Use column "order" mode not the column "name" mode. It's also really finicky about types - doesn't like java.util.Dates but wants Timestamps, doesn't like Integers but need Longs, etc.

Then do an "append"->"saveAsTable". By some miracle it worked. Agreed it would be fantastic if they accepted a Map<String, Object> to insert a row or let you map columns using names. But they probably want to discourage this given the nature of warehouse performance for row based operations.

In Java...

DataFrame dfSchema = session.sql("select * from TARGET_TABLE limit 1");
StructType schema = dfSchema.schema();
System.out.println(schema);
Row[] rows = new Row[]{Row.fromArray(new Object[]{endpoint.getDatabaseTable(), statusesArr, numRecords, Integer.valueOf(filenames.size()).longValue(), filenamesArr, urlsArr, startDate, endDate})};
DataFrame df = session.createDataFrame(rows, schema);
System.out.println(df.showString(0, 120));
df.write().mode("Append").saveAsTable("TARGET_TABLE"); 
TomDurden
  • 41
  • 4
  • Well, this is what I too mentioned - the selectivity of columns is available in Scala and Java API, but not the Python one currently. However, I don't think that this has something to do with just row level ops. Delta Lake format in Databricks allow you to enter data for only certain columns, even in a bulk load scenario. So, let's says you have table A with columns p, q, r and then table B with columns p and q, you would be able to copy data from table B into table A (with column r having NULL values) easily. This feature is useful in a MERGE scenario too. – halfwind22 Oct 26 '22 at 08:08
  • I worked around this by writing an INSERT statement with the columns I need :( – halfwind22 Oct 26 '22 at 08:10
  • I'm taking the same direction as you. I'm switching to JDBC/ODBC and just do a regular INSERT or UPDATE statements. For the MERGE scenario, Snowpark just doesn't add much to the toolkit because of the limitations you mentioned. Writing SQL is still the way to go. – TomDurden Oct 26 '22 at 18:55
1

In the save_as_table method, use the parameter column_order="name". See Snowflake save_as_table docs. This should match the columns by name and allow you to omit missing columns without the column number mismatch error.

It's also good practice to include a schema when creating your session. See Snowflake create_dataframe docs on using the StructType class.

JContreras
  • 11
  • 1
  • 1
    Looks like Snowflake did a good documentation revamp, and also added on the missing feature in save_as_table API. – halfwind22 Feb 08 '23 at 06:21