Pyarrow read/write from s3

Question

Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow.

Here is my code:

import pyarrow.parquet as pq
import pyarrow as pa
import s3fs

s3 = s3fs.S3FileSystem()

bucket = 'demo-s3'

pd = pq.ParquetDataset('s3://{0}/old'.format(bucket), filesystem=s3).read(nthreads=4).to_pandas()
table = pa.Table.from_pandas(pd)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression='snappy')

Is there any reason not to use s3fs to copy the files? – mdurant Jun 26 '18 at 16:51 — mdurant, Jun 26 '18 at 16:51

score 9 · Answer 1 · answered Jun 26 '18 at 16:56

If you do not wish to copy the files directly, it appears you can indeed avoid pandas thus:

table = pq.ParquetDataset('s3://{0}/old'.format(bucket),
    filesystem=s3).read(nthreads=4)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), 
    filesystem=s3, use_dictionary=True, compression='snappy')

Igor Tavares · Answer 2 · 2020-04-18T14:26:43.380

0

Why not just copy directly (S3 -> S3) and save memory and I/O?

import awswrangler as wr

SOURCE_PATH = "s3://..."
TARGET_PATH = "s3://..."

wr.s3.copy_objects(
    source_path=SOURCE_PATH,
    target_path=TARGET_PATH
)

Reference

edited Apr 18 '20 at 14:26

answered Jan 10 '20 at 12:41

Igor Tavares

869
11
8

Pyarrow read/write from s3

2 Answers2

Linked