I'm copying a large volume of data from S3 into redshift. The data is stored in a series of hierarchical folders. At the bottom level of the folders [depth of n=3] there are files without an extension; the contents are formatted as JSON. Another note: it is not partitioned data either.
Previously this data was being crawled/read by Athena, but that was very slow and ran into a lot of timeouts. (2tb, hundreds of billions or trillions rows in source table)
I am now attempting to use a copy statement to copy a several hundred gigabyte subset of the data into redshift from s3. The table I created to copy it into:
CREATE TABLE data_raw (
json_raw SUPER
)
When running the copy statement, I get the following error after about 45 minutes:
ERROR: S3CurlException: Connection timed out after 50001 milliseconds, CurlError 28, multiCurlError 0, CanRetry 1, UserError 0 Detail:
-----------------------------------------------
error: S3CurlException: Connection timed out after 50001 milliseconds, CurlError 28, multiCurlError 0, CanRetry 1, UserError 0 code: 9002 [...] query: #### location: s3_utility.cpp:665 process: padbmaster [pid=200##]
I have read this question regarding copy statements in redshift, looked into this question about curl timeouts and read this git issue regarding some of the timeout properties. Granted, the linked issue seems to be for a different part of AWS tooling than what I am using.
What I want to know:
- is this something that can be resolved either via redshift or S3 configs? (I get the impression that it cannot be)
- what is a workaround that can be used?
- is part of the issue that the table I used just has a single
SUPER
column?
Other things to know:
- I already validated the json
- The size of the data subset I am trying to copy is above 200gb [not sure since the calculation failed at some point past that number]