6

I'm trying to use DuckDB in a jupyter notebook to access and query some parquet files held in s3, but can't seem to get it to work. Judging on past experience, I feel like I need to assign the appropriate file system but I'm not sure how/where to do that.

The below code raises the error: RuntimeError: IO Error: No files found that match the pattern "s3://<bucket>/<file>.parquet"

import boto3
import duckdb

s3 = boto3.resource('s3')
client=boto3.client("s3")
con = duckdb.connect(database=':memory:', read_only=False)

con.execute("""
SET s3_region='-----';
SET s3_access_key_id='-----';
SET s3_secret_access_key='-----';
""")
out = con.execute(f"select * from parquet_scan('s3://<bucket>/<file>.parquet') limit 10;").fetchall()

I'd like to use the pandas read_sql functionality if I can, but put this code to avoid adding complexity to the question.

I'm confused because this code works:

import pandas as pd
import boto3

s3 = boto3.resource('s3')
client=boto3.client("s3")

df = pd.read_parquet("s3://<bucket>/<file>.parquet")
Ethan
  • 61
  • 1
  • 2
  • 1
    DuckDB appears to not include it's "httpfs" (used for http and s3) [extension on Python](https://github.com/duckdb/duckdb/issues/2350). – Anon Coward Nov 01 '21 at 22:21
  • anyone looking for GUI tool - https://stackoverflow.com/a/74165348/6563567 – ns15 Oct 27 '22 at 09:33

3 Answers3

6

Do this instead

con.execute("""
INSTALL httpfs;
LOAD httpfs;
SET s3_region='-----';
SET s3_access_key_id='-----';
SET s3_secret_access_key='-----';
""")
Cesar Ades
  • 63
  • 1
  • 4
3

HTTPFS is not included in the package. However you can build it from source, see the snippet below

This assumes you have python3/pip3 installed on your linux machine or container. I'm using photon4 container, on other linux distros package names might be slightly different.

# install build dependencies
yum install build-essential cmake git openssl-devel python3-devel -y

# install python dependencies
pip3 install setuptools numpy

# clone repo with source code
git clone https://github.com/duckdb/duckdb && cd duckdb

# set a flag to build package with HTTPFS
export BUILD_HTTPFS=1
make

# once duckdb libs are built, do it again enabling python package
export BUILD_PYTHON=1
make

After that it should be available even without running pip install. If you building it in separate container, you can copy or replace this file: tools/pythonpkg/build/lib.linux-x86_64-3.9/duckdb.cpython-39-x86_64-linux-gnu.so to /usr/lib/python3.9/site-packages/ on your target machine

Mike Twc
  • 2,230
  • 2
  • 14
  • 19
1
def query_s3_duckdb(bucket_name, object_name):
    try:
        # Connect to DuckDB and load Parquet file
        conn = duckdb.connect()
        conn.execute("INSTALL httpfs")
        conn.execute("LOAD httpfs")
        conn.execute("SET s3_region='us-east-1'")
        conn.execute("SET s3_access_key_id='xxxxxxxxxxxxxxxxx'")
        conn.execute("SET s3_secret_access_key='xxxxxxxxxxxxxxxxxx'")
        conn.execute(f"CREATE TABLE data AS SELECT * FROM read_parquet('s3://{bucket_name}/{object_name}')")

        # Query data
        result = conn.execute("SELECT COUNT(*) FROM data")
        print(f"Number of rows in data: {result.fetchone()[0]}")

        # Cleanup
        conn.close()
    except S3Error as e:
        print(f"Error querying data from S3: {e}")

This should work!!

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 04 '23 at 23:32