5

Since Modin does not support loading from multiple pyarrow files on s3, I am using pyarrow to load the data.


    import s3fs
    import modin.pandas as pd
    from pyarrow import parquet
    
    s3 = s3fs.S3FileSystem(
        key=aws_key,
        secret=aws_secret
    )

    table = parquet.ParquetDataset(
        path_or_paths="s3://bucket/path", 
        filesystem=s3,
    ).read(
        columns=["hotelId", "startDate", "endDate"]
    )

    # to get a pandas df the next step would be table.to_pandas()

If I know want to put the data in a Modin df for parallel computations without having to write to and read from a csv? Is there a way to construct the Modin df directly from a pyarrow.Table or at least from a pandas dataframe?

galinden
  • 610
  • 8
  • 13

2 Answers2

1

Mahesh's answer should work but I believe it would result in a full data copy (2X memory footprint by default: https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)

At the time of writing Modin does have a native arrow integration, so you can directly convert using

from modin.pandas.utils import from_arrow

mdf = from_arrow(pyarrow_table)
0

You can't construct the Modin dataframe directly out of a pyarrow.Table, because pandas doesn't support that, and Modin only supports a subset of the pandas API. However, the table has a method that converts it to a pandas dataframe, and you can construct the Modin dataframe out of that. Using table from your code:

import modin.pandas as pd
modin_dataframe = pd.Dataframe(table.to_pandas())