I want to generate the TPC-DS data (1 TB and 10 TB) directly in AWS S3 without transferring from local machine to s3. What is the easiest way to do that?
Asked
Active
Viewed 1,021 times
1 Answers
1
I did similar work several month ago, hive-testbench can be an option.
Check the README.md
about how to make it happen.
You need to configure fs.defaultFS
in $HADOOP_HOME/etc/hadoop/core-site.xml
to your AWS S3 bucket, the data will be generated in AWS directly.
Pass data scale parameter to ./tpcds-setup.sh
to generate date with different scale.

Eugene
- 10,627
- 5
- 49
- 67