0

I'm trying to read parquet file but for each method I'm getting error. I can read any csv file from S3 using CSV.read, I can read parquet file when a file is locally, but I cannot read it from S3.

using AWS, AWSS3, CSV, DataFrames, Parquet
p = S3Path("s3://mybucket/")
file = joinpath(p, "test.parquet")
DataFrame(read_parquet(file)) #ERROR: MethodError: no method matching Parquet.File(::S3Path{Nothing})
#Directly
DataFrame(read_parquet(s3_get(aws, "mybucket", "test.parquet"))) #ERROR: MethodError: no method matching Parquet.File(::S3Path{Nothing})
awoj
  • 133
  • 1
  • 10
  • It seems like [Parquet2.jl](https://expandingman.gitlab.io/Parquet2.jl/) can do this, and this [FAQ answer](https://expandingman.gitlab.io/Parquet2.jl/faq/#Why-start-from-scratch-instead-of-improving-[Parquet.jl](https://github.com/JuliaIO/Parquet.jl)?) seems to imply that Parquet.jl (the one you're using here) doesn't have this feature. – Sundar R Aug 25 '23 at 16:35
  • Thank you Sundar for mentioning Parquet2, in fact it works! – awoj Aug 28 '23 at 05:07

2 Answers2

0

As @Sundar R mentioned, Parquet2 solves that issue:

ds = Parquet2.Dataset("s3://mybucket/test.parquet") |> DataFrame
awoj
  • 133
  • 1
  • 10
0

An alternative answer, since I had a similar problem with jld2 files. (See my question from some time ago: load-julia-jld-2-file-from-aws-s3

When you want to load a file from an S3 Bucket, you get a byte-stream-vector back. An alternative can then be to write this byte-stream to a file and then load this local copy of the file.

using AWS
using AWSS3
using JLD2

aws = global_aws_config()
p = S3Path("s3://my/path/to/", config=aws)

byte_vector = read(joinpath(p, "filename.jld2"))
write(joinpath("to", "my", "test.parquet"), byte_vector)
my_object = read_parquet(joinpath("to", "my", "file.jld2")) |> 
DataFrame
Georgery
  • 7,643
  • 1
  • 19
  • 52