3

I need to create data frame using pandas library using parquet files hosted on a google cloud storage bucket. I have searched the documents and online examples but can't seem to figure out how to go about it.

Could you please assist me by pointing me towards the right direction?

I am not looking for a solution but for a location where I could look for further information so that I could devise my own solution.

Thank you in advance.

User9102d82
  • 1,172
  • 9
  • 19

2 Answers2

5

You may use gcsfs and pyarrow libraries to do so.

import gcsfs
from pyarrow import parquet

url = "gs://bucket_name/.../folder_name"
fs = gcsfs.GCSFileSystem()

// Assuming your parquet files start with `part-` prefix
files = ["gs://" + path for path in fs.glob(url + "/part-*")]
ds = parquet.ParquetDataset(files, filesystem=fs)
df = ds.read().to_pandas()
Terence
  • 51
  • 1
  • 2
  • suggest to do `"/*.parquet"` instead of `"/part-*"` to be more explicit. – Shern Feb 25 '22 at 10:15
  • can you help with this one ? https://stackoverflow.com/questions/75529064/how-to-load-multiple-partition-parquet-files-from-gcs-into-pandas-dataframe – Regressor Feb 22 '23 at 15:16
1

You can read it with pandas.read_parquet like this:

df = pandas.read_parquet('gs:/bucket_name/file_name')

Additionally you will need gcsfs library and either pyarrow or fastparquet installed.

Don't forget to provide credentials in case you access private bucket.

Emil Gi
  • 1,093
  • 3
  • 9
  • Hi, Thank you for your answer. It is close but there is an issue. The said method reads a parquet file - agreed but it if a folder has multiple parquet files - it doesn't work OR is it that some other option is to be added? Basically I will not know whether there would be a single parquet file or multiple, and that is what I need to achieve. – User9102d82 Feb 26 '20 at 11:11
  • You can get a list of files in the bucket and then iterate over it with a loop and read files one by one. Refer to [this question](https://stackoverflow.com/q/54988092/12232507) for an example. I don't think there is a method to read the entire bucket at once. – Emil Gi Feb 26 '20 at 11:57
  • I would recommend you to kindly update your last comment as answer and I shall accept it as there is no other alternative. – User9102d82 Mar 30 '20 at 09:59