I'm trying to use ParquetFileReader to read files I'm receiving from S3 using a custom InputFile class since it's not a local file and I can't create a local temp file either.
He is my custom class based in this answer:
class ParquetInputFile(stream: ByteArray) : InputFile {
var data: ByteArray = stream
private class SeekableByteArrayInputStream(buf: ByteArray?) : ByteArrayInputStream(buf) {
var pos: Long = -1
}
override fun getLength(): Long {
return data.size.toLong()
}
override fun newStream(): SeekableInputStream {
return object : DelegatingSeekableInputStream(SeekableByteArrayInputStream(data)) {
override fun seek(newPos: Long) {
(stream as SeekableByteArrayInputStream).pos = newPos
}
override fun getPos(): Long {
return (stream as SeekableByteArrayInputStream).pos
}
}
}
override fun toString(): String {
return "com.test.ParquetInputFile[]"
}
}
And here is my code getting the file from S3 and using the classe above:
val bucketFile = s3Client.getObject(
GetObjectRequest(
"bucket-name",
"test_file.parquet"
)
)
val file = ParquetInputFile(bucketFile.objectContent.readAllBytes())
val reader = ParquetFileReader.open(file)
I'm getting an exception in the .open()
execution when the reader try to read the file footer, it's saying it's not a parquet file while checking the "Magic" byte array
I made a fast test, using the same S3 file, but reading it from the local disk using a deprecated method from ParquetFileReader
and it works:
val local = ParquetFileReader.open(Configuration(), Path("/Users/casky/Documents/pocs/resources/test_file.parquet"))
Debugging the same readFooter
method, I saw that when it reads the fileMetadataLength
here the size is different from local file and the S3 file, and actually, when the readIntLittleEndian()
execute the 4 read functions here, the result returned in ch1, 2, 3 and 4
are:
Local File: 242, 7, 0, 0 returning 2034
S3 File: 80, 65, 82, 49 returning 827474256
But, as you can see, the values from ch1, 2, 3 and 4
are the correct value that the Parquet MAGIC array wants.
Now I'm not sure if the custom class is messing up it somehow, or if the Path
object do something with the file content while reading it from local disk.