1

looking for something like this:

Save Dataframe to csv directly to s3 Python

the api shows these arguments: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html

but i'm not sure how to convert the df into a stream...

rnd om
  • 295
  • 2
  • 11
  • Maybe what you are looking for is [`s3fs`](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3File). – suvayu Jan 14 '23 at 02:06

1 Answers1

5

Untested, since I don't have an AWS account

You could use s3fs.S3File like this:

import polars as pl
import s3fs

fs = s3fs.S3FileSystem(anon=True)  # picks up default credentials
df = pl.DataFrame(
    {
        "foo": [1, 2, 3, 4, 5],
        "bar": [6, 7, 8, 9, 10],
        "ham": ["a", "b", "c", "d", "e"],
    }
)
with fs.open('my-bucket/dataframe-dump.parquet', mode='wb') as f:
    df.write_parquet(f)

Basically s3fs gives you an fsspec conformant file object, which polars knows how to use because write_parquet accepts any regular file or streams.

If you want to manage your S3 connection more granularly, you can construct as S3File object from the botocore connection (see the docs linked above).

suvayu
  • 4,271
  • 2
  • 29
  • 35
  • 1
    this definitely works, im not sure what i was thinking not opening the file first before trying `df.write_parquet(f)` you also need to use 'wb'. thank you. any reason to use s3fs instead of boto3? havent used this one before. – rnd om Jan 14 '23 at 02:26
  • 1
    @rndom Using something like s3fs gives you the flexibility to not care what you are passing. e.g. you could easily pass a local file for testing, but run on S3 in production without changing your code. Achieving that with boto3 yourself might require a lot of implementation on your part. Does boto3 offer a file like API? (I fixed the mode in the answer) – suvayu Jan 14 '23 at 14:09
  • when you use boto3 with something like `put_object` aws returns a response. with s3fs i don't get anything like that, which i can figure out. it would be great to use boto3 as a context manager, but i guess that's why s3fs uses https://filesystem-spec.readthedocs.io/en/latest/ ? – rnd om Jan 14 '23 at 15:47
  • Ya, pretty much. It's an abstraction so that you can treat your blobs like files on a filesystem. Then you don't have to wait for a response like boto3, instead you would expect `s3fs` succeed or raise an appropriate exception (or log). You could look at the source to understand the behaviour better. – suvayu Jan 15 '23 at 13:11