Apache Beam: reading multiple files with beam.dataframe.io.read_csv returns _ReadFromPandas objects instead of dataframes

Question

I'm trying to read multiple csv files. Why is this code returning <_ReadFromPandas(PTransform) label=[_ReadFromPandas]>? Here's the read_csv code: https://beam.apache.org/releases/pydoc/2.25.0/_modules/apache_beam/dataframe/io.html

pcol_of_dfs = (p 
    | 'Match files' >> beam.io.fileio.MatchFiles(path)
    | 'Read Files' >> beam.Map(lambda file_meta: beam.dataframe.io.read_csv(file_meta.path))
)

Ultimately I want to read all csv files and append the file names as additional column.

I have several hundreds of gzipped csv files in a GCS bucket. All of them have identical set of columns. All have headers. The csv values may contain line breaks. Files vary in size from a few kb to ~5GB.

score -1 · Answer 1 · answered May 23 '23 at 23:12

Per this post, to read multiple CSV files into a single Pandas DataFrame, you can use the pandas.concat() function instead.

import pandas as pd
import glob

path = r'your_path_here' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Apache Beam: reading multiple files with beam.dataframe.io.read_csv returns _ReadFromPandas objects instead of dataframes

1 Answers1