1

Please share your experiences wrt unloading the data from snowflake - The table has million rows and each row is around 16MB data.

The "The copy into '@ext_stg/path/file_name' from schema.table" has to generate separate file for each row. Intent is to generate million files in S3.

The "Copy into" is designed to write bulk data at once. Using "Copy into" to generate separate files for each row is extremely slow.

Thanks!

sim
  • 41
  • 3
  • How are you exporting 1 record at a time through COPY INTO? Can you include your statement in your question? This will help other users in answering your question more accurately. – Mike Walton May 20 '20 at 01:55

2 Answers2

1

Snowflake's COPY INTO LOCATION statement writes in the ndjson format which already makes it very simple to divide the records down with a little local processing.

It appears you've already tried doing a row-by-row iteration to perform such single row exports and have found it expectedly slow. It may still be a viable option if this is only a one-time operation.

Snowflake does not offer any parallel split and per-row export techniques (that I am aware of) so it may be simpler instead to export the entire table normally, and then use a downstream parallel processing framework (such as a Spark job) to divide the input into individual record files. The ndjson format's ready-to-be-split nature makes processing the file easy in distributed program frameworks.

P.s. Specifying the MAX_FILE_SIZE copy option to a very low value (lower than the minimum bound of your row size) will not guarantee a single file per row as the writes are done over sets of rows read together from the table.

0

You can achieve this through scripting using python or even with the snowflake javascript procedure.

Pseudocode would look like this:

var_filter_list = select primary_key from schema.table;  -- primary key or unique identifier

for idx, pk_val in enumerate(var_filter_list):    -- for each row
  var_file_name = concat(file_name,idx)
  copy into @ext_stg/path/var_file_name  from ( select * from  schema.table where 
  primary_key = pk  );