1

I'm working with Snowflake and some JSON files that I need to upload into a Staging area. Since Snowflake doesn't allow files whose size is bigger than 1GB, I had to split them into smaller files using 7zip.

I ended up with 4 files like the ones below

Files were uploaded to the Staging area with a pattern as you can see in the attached image.

I'm trying to copy those files in the Staging area to another table by using following commands

copy into yelp_user from @staging/yelp_academic_dataset_user.json.gz  file_format
                                  =(format_name=yelp_user) on_error='skip_file';

Which gets me this error:

002019 (0A000): SQL compilation error:JSON file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.

Then I tried creating a JSON table as:

CREATE OR REPLACE TABLE json_table_user(json_data variant);

copy into JSON_TABLE_USER  file_format =(format_name = 'yelp_user') files=('yelp_academic_dataset_user.json.001.gz','yelp_academic_dataset_user.json.002.gz','yelp_academic_dataset_user.json.003.gz','yelp_academic_dataset_user.json.004.gz') on_error = 'skip_file';

And I get errors saying that

Remote file 'https://gcpuscentral1-ncxq405-stage.storage.googleapis.com/tables/2807681033/yelp_academic_dataset_user.json.004.gz' was not found. There are several potential causes. The file might not exist. The required credentials may be missing or invalid. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.

This is driving me crazy as following tutorials on Snowflake's website won't help me.

Does anyone knows how I can copy those split files into a table the way I need to?

ilikawa
  • 25
  • 10
  • Why do you think Snowflake can't load files > 1gb? In the last COPY statement you give, where is the required FROM clause? – NickW Apr 25 '21 at 06:47
  • Because I tried to do that on the web client and I get the error about the maximum size allowed. On the other Copy statement, I didn’t include it as by default Snowflake will place last location I was working on which was the staging schema and where I have those files. – ilikawa Apr 25 '21 at 12:27
  • You shouldn't be trying to use the WebUI to load large files. If your files are on a local PC then you need to use Snowsql to run a put command to load them into an internal stage; if your files are on AWS/Azure/GCP then you use an external stage. I suggest you always use a FROM clause rather than making assumptions about where Snowflake will look for the files – NickW Apr 25 '21 at 16:30
  • You can also split files using https://github.com/Mitch-Wheat/FileSplitter – Mitch Wheat Apr 27 '21 at 03:05
  • See https://stackoverflow.com/a/68718176/132438 to split files easily into smaller ones. – Felipe Hoffa Aug 09 '21 at 20:25

1 Answers1

0

I'm not sure what went wrong in your process, but I found a similar file to repeat it.

In this case, instead of uploading to a Snowflake internal stage, I told Snowflake to read it from a GCS bucket. To do so, first I created an integration:

use role accountadmin;

CREATE STORAGE INTEGRATION fh_gcp
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = GCS
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ('gcs://fhoffa-snow/')  
;
describe integration fh_gcp; 
-- give access to the snowflake gcp account to the bucket in gcp
grant usage on integration fh_gcp to role sysadmin;
--

use role sysadmin;

create stage fh_gcp_stage
url = 'gcs://fhoffa-snow/'
storage_integration = fh_gcp;

list @fh_gcp_stage; -- check files exist

Then I slightly modified your SQL to read from this stage. Note that I didn't need to split the file, Snowflake is happy to read files larger than 1gb:

create temp table json_table_user(json_data variant);

copy into JSON_TABLE_USER
from @fh_gcp_stage
file_format = (type=json)
files=('202104/yelp_academic_dataset_user.json.gz') on_error = 'skip_file'
;

And then you can start having fun with queries and semi-structured data:

select median(json_data:average_stars) stars
    , median(json_data:review_count) reviews
    , median(json_data:funny) funny
    , median(json_data:useful) useful
    , count(*) c
from json_table_user ;

enter image description here

A quick fix to what you tried above could be reading the existing file into the table you created:

copy into json_table_user from @staging/yelp_academic_dataset_user.json.gz  file_format
                                  =(format_name=yelp_user) on_error='skip_file';
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Thanks Felipe, that's a really good approach even thought I'm not familiar with it. – ilikawa Apr 28 '21 at 16:40
  • Also about querying the data in the json tables, how do I retrieve value in "garage" within tje BusinessParking column. { "address": "921 Pearl St", "attributes": { "Alcohol": "'beer_and_wine'", "Ambience": "{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casual': True}", "BusinessParking": "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}" } I have tried: json_data:attributes.BusinessParking.garage but all I get are null values. – ilikawa Apr 28 '21 at 16:48
  • Happy to answer that one in a new question (for space and formatting) – Felipe Hoffa Apr 28 '21 at 17:10
  • Sorry but it's difficult to format code within a comment and that's how it went. In case you can take a look into my other question, https://stackoverflow.com/questions/67305862/snowflake-get-value-of-key-in-an-object-within-another-object, I will very much appreaciate it. – ilikawa Apr 28 '21 at 18:23