1

In springboot application, I'm using hadoop to read parquet file from s3 amazon bucket. After getting the target file as inputstream, I want to read it. Here is my code

var s3="s3a://bucketX/file.parquet";
Path s3Path = new Path(s3);

Configuration configuration = new Configuration();
configuration.set("fs.s3a.aws.credentials.profileName", "profileX"); //profileX have the permission to read the file
configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

configuration.set("fs.s3a.endpoint", "s3-eu-west-3.amazonaws.com"); 
configuration.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider"); 

var s3fs=new S3AFileSystem();
s3fs.initialize(new URI(s3), configuration);
InputStream s3InputStream = s3fs.open(new Path(s3));

Here is my pom.xml configuration

   <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>3.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-hadoop</artifactId>
        <version>1.12.3</version>
    </dependency>

ParquetFileReader expect a HadoopInputFile as an input. How can I convert the input stream to HaddopInputFile ?

ParquetFileReader reader = ParquetFileReader.open(convertToHadoopInputStream(s3InputStream)) 
GHASSEN
  • 51
  • 2
  • https://stackoverflow.com/questions/59939309/read-local-parquet-file-without-hadoop-path-api this might help? – Ben Watson Mar 27 '23 at 10:49
  • Thanks. But, it is not the same case. I want to convert the inputStream to hadoop InputStream. – GHASSEN Mar 27 '23 at 11:33
  • Why do you want it as a HadoopInputStream? It looks like what you really want is to read a Parquet file using ParquetFileReader. The solution in the question I linked will let you do that without using Hadoop anywhere. – Ben Watson Mar 27 '23 at 11:44
  • @BenWatson the author of the question appears to want to read the data off s3, and to do that through parquet, yes, the hadoop FS API is needed. – stevel Mar 28 '23 at 16:25

1 Answers1

1
  1. fs.s3a.aws.credentials.profileName is not a valid s3a option. all properties for that connector are in lower case. FWIW, there isn't a lower case equivalent.
  2. fs.s3a.impl is some stack overflow superstition. using it implies you haven't looked at the hadoop s3a docs, which is where you should start when looking at configuring it, not out of date SO posts.
  3. no need to open the file itself. Use the ParquetFileReader(Configuration conf, Path file, MetadataFilter filter) constructor giving the relevant hadoop conf and hadoop Path type, a filter (possibly the NO_FILTER), and let it do the work.
  4. and, in future, use FileSystem.get(Path, Configuration) to create and init an s3a instance; it also does caching.
stevel
  • 12,567
  • 1
  • 39
  • 50