1

I have Paraquet files in my S3 bucket which is not AWS S3.

Is there a tool that connects to any S3 service (like Wasabi, Digital Ocean, MinIO), and allows me to query the Parquet files?

  • 1
    Yes, the easiest builtin tool they provide for this is athena. but there are many "warehouse" services that can injest data from s3 in parquet format. – Tom Slabbaert Oct 20 '22 at 08:21
  • @TomSlabbaert note that my data is not stored in AWS S3. Is Athena allowing receiving data from an external source? – Sasha Chernin Oct 20 '22 at 08:54
  • 1
    Yes, athena allows integration with external sources - not sure how simple this process will be for you though. – Tom Slabbaert Oct 20 '22 at 09:05

2 Answers2

1

In case you need a GUI tool then you can use DBeaver + DuckDB. For programmatic use, You can find DuckDB library for most languages.

Here is my other answer on the same topic.

There is a slight difference since you are querying data on a S3 compatible storage. You simply need to run few additional commands mentioned here in the DuckDB docs.

-- Execute these commnds one by one or run as a script from DBeaver
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
SET s3_access_key_id='****';
SET s3_secret_access_key='****';

-- Read the data using S3 protocol
select * from read_parquet('s3://test-me-ses/userdata1.parquet');

In case you have parquet files hosted and served from S3 or any web server via HTTP - DuckDB has this covered as well.

-- Read the data using HTTP protocol if parquet file is hosted
SELECT * FROM read_parquet('https://test-me-ses.s3.amazonaws.com/userdata1.parquet');

Any S3 compatible object store(Wasabi, Digital Ocean, MinIO) should work similarly..

enter image description here

You can also write the data back as parquet after transformation to any S3 compatible storage(AWS, MinIO etc..).

All of these can be done programmatically as well.

ns15
  • 5,604
  • 47
  • 51
0

With MongoDB this can be done with our Atlas Data Federation product https://www.mongodb.com/docs/atlas/data-federation/overview/

It can query parquet files stored in S3.

Joe Drumgoole
  • 1,268
  • 9
  • 9
  • Thanks, but I see that it references AWS S3 only, not other S3 services like Wasabi or Digital Ocean. Is it possible to query parquet files that are hosted on other S3 services? – Sasha Chernin Oct 20 '22 at 14:31
  • 1
    You could use our HTTP interface https://www.mongodb.com/docs/atlas/data-federation/config/config-http-endpoint/ and point it at the other buckets. – Joe Drumgoole Oct 28 '22 at 00:03