0

I'm trying to read multiple csv files. Why is this code returning <_ReadFromPandas(PTransform) label=[_ReadFromPandas]>? Here's the read_csv code: https://beam.apache.org/releases/pydoc/2.25.0/_modules/apache_beam/dataframe/io.html

pcol_of_dfs = (p 
    | 'Match files' >> beam.io.fileio.MatchFiles(path)
    | 'Read Files' >> beam.Map(lambda file_meta: beam.dataframe.io.read_csv(file_meta.path))
)

Ultimately I want to read all csv files and append the file names as additional column.

I have several hundreds of gzipped csv files in a GCS bucket. All of them have identical set of columns. All have headers. The csv values may contain line breaks. Files vary in size from a few kb to ~5GB.

stkvtflw
  • 12,092
  • 26
  • 78
  • 155

1 Answers1

-1

Per this post, to read multiple CSV files into a single Pandas DataFrame, you can use the pandas.concat() function instead.

import pandas as pd
import glob

path = r'your_path_here' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)
Joevanie
  • 489
  • 2
  • 5