Since Modin does not support loading from multiple pyarrow files on s3, I am using pyarrow to load the data.
import s3fs
import modin.pandas as pd
from pyarrow import parquet
s3 = s3fs.S3FileSystem(
key=aws_key,
secret=aws_secret
)
table = parquet.ParquetDataset(
path_or_paths="s3://bucket/path",
filesystem=s3,
).read(
columns=["hotelId", "startDate", "endDate"]
)
# to get a pandas df the next step would be table.to_pandas()
If I know want to put the data in a Modin df for parallel computations without having to write to and read from a csv? Is there a way to construct the Modin df directly from a pyarrow.Table or at least from a pandas dataframe?