Datafusion GCS to S3 bucket

Question

I am trying to move a CSV file from a GCS bucket to an AWS S3 bucket.

Considerations -

The CSV file is dynamically generated, so schema unknown
Filename should be same once transferred to the S3 bucket

In both cases CDF is failing. When providing schema column name as body, type byte it fails with exception 'Illegal base64 character 5f'.

File name is changed to part* when the file lands in S3 bucket, when specific schema is given. This should be a simple task for datafusion. Is there any way to achieve it.

score 0 · Answer 1 · answered Jun 12 '23 at 05:49

0

The illegal base64 error might be caused by wrong encoding of the input csv file.

As for the second issue, files written to AWS S3 bucket are auto-named by Hadoop with the format 'part-r-taskNumber'. You can try overriding this by setting the file system properties in the plugin to include:

{
  "mapreduce.output.basename": "<your-prefix>"
}

See documentation here

answered Jun 12 '23 at 05:49

vanathi-g

36
1

Thanks vanathi. The file system properties resulted into a file with format -r-part. I checked the documentation before but it does not provide any list of available parmeters or details that can be used. As such when moving objects between buckets, the object name would need to be the same. Issue is even if i copy and rename the object in the S3 side, it will create multiple copies, and datafusion does not provide a plugin to delete objects in S3, just complete buckets can be deleted. Any way we can keep the name same while copying to S3 itself. – RaptorX Jun 12 '23 at 08:02

score 0 · Answer 2 · answered Jun 15 '23 at 05:57

0

Referred How to write a file or data to an S3 object using boto3 using cloud functions for a solution. CDF does not provide functionality for effective S3 transfer from GCS.

answered Jun 15 '23 at 05:57

RaptorX

113
10

Datafusion GCS to S3 bucket

2 Answers2