1

I'm trying to use ParquetFileReader to read files I'm receiving from S3 using a custom InputFile class since it's not a local file and I can't create a local temp file either.

He is my custom class based in this answer:

class ParquetInputFile(stream: ByteArray) : InputFile {

    var data: ByteArray = stream

    private class SeekableByteArrayInputStream(buf: ByteArray?) : ByteArrayInputStream(buf) {
        var pos: Long = -1
    }

    override fun getLength(): Long {
        return data.size.toLong()
    }

    override fun newStream(): SeekableInputStream {
        return object : DelegatingSeekableInputStream(SeekableByteArrayInputStream(data)) {

            override fun seek(newPos: Long) {
                (stream as SeekableByteArrayInputStream).pos = newPos
            }

            override fun getPos(): Long {
                return (stream as SeekableByteArrayInputStream).pos
            }
        }

    }

    override fun toString(): String {
        return "com.test.ParquetInputFile[]"
    }
}

And here is my code getting the file from S3 and using the classe above:

val bucketFile = s3Client.getObject(
    GetObjectRequest(
        "bucket-name",
        "test_file.parquet"
    )
)

val file = ParquetInputFile(bucketFile.objectContent.readAllBytes())
val reader = ParquetFileReader.open(file)

I'm getting an exception in the .open() execution when the reader try to read the file footer, it's saying it's not a parquet file while checking the "Magic" byte array

I made a fast test, using the same S3 file, but reading it from the local disk using a deprecated method from ParquetFileReader and it works:

val local = ParquetFileReader.open(Configuration(), Path("/Users/casky/Documents/pocs/resources/test_file.parquet"))

Debugging the same readFooter method, I saw that when it reads the fileMetadataLength here the size is different from local file and the S3 file, and actually, when the readIntLittleEndian() execute the 4 read functions here, the result returned in ch1, 2, 3 and 4 are:

Local File: 242, 7, 0, 0 returning 2034

S3 File: 80, 65, 82, 49 returning 827474256

But, as you can see, the values from ch1, 2, 3 and 4 are the correct value that the Parquet MAGIC array wants.

Now I'm not sure if the custom class is messing up it somehow, or if the Path object do something with the file content while reading it from local disk.

Kyore
  • 388
  • 2
  • 7
  • 29
  • As it is working fine in local, can you please confirm if you performed the following steps - 1. Please check if the S3 bucket path has corrupted files. 2. Create a new path in S3 and try again. – Shash Jan 24 '22 at 20:25
  • If you are trying to read multiple files in a bucket to perform analysis, there are likely better tools than downloading each file and parsing their binary manually. For example, Flink or Spark on the whole S3 path. Or using s3-select, or Presto for SQL. – OneCricketeer Jan 24 '22 at 21:45

0 Answers0