I'm trying to use DuckDB in a jupyter notebook to access and query some parquet files held in s3, but can't seem to get it to work. Judging on past experience, I feel like I need to assign the appropriate file system but I'm not sure how/where to do that.
The below code raises the error: RuntimeError: IO Error: No files found that match the pattern "s3://<bucket>/<file>.parquet"
import boto3
import duckdb
s3 = boto3.resource('s3')
client=boto3.client("s3")
con = duckdb.connect(database=':memory:', read_only=False)
con.execute("""
SET s3_region='-----';
SET s3_access_key_id='-----';
SET s3_secret_access_key='-----';
""")
out = con.execute(f"select * from parquet_scan('s3://<bucket>/<file>.parquet') limit 10;").fetchall()
I'd like to use the pandas read_sql
functionality if I can, but put this code to avoid adding complexity to the question.
I'm confused because this code works:
import pandas as pd
import boto3
s3 = boto3.resource('s3')
client=boto3.client("s3")
df = pd.read_parquet("s3://<bucket>/<file>.parquet")