11

I have a python object which I know this is a parquet file loaded to the object. (I do not have the possibility to actually read it from a file).

The object var_1 contains b'PAR1\x15\x....1\x00PAR1

when I check the type:

type(var_1)

I get the result is bytes.

Is there a way to read this ? say into a pandas data-frame ?

I have tried: 1)

from fastparquet import ParquetFile
pf = ParquetFile(var_1)

And got:

TypeError: a bytes-like object is required, not 'str'

2

import pyarrow.parquet as pq
dataset = pq.ParquetDataset(var_1)

and got:

TypeError: not a path-like object

Note, the solution to How to read a Parquet file into Pandas DataFrame?. i.e pd.read_parquet(var_1, engine='fastparquet') results in TypeError: a bytes-like object is required, not 'str'

Asclepius
  • 57,944
  • 17
  • 167
  • 143
AnarKi
  • 857
  • 1
  • 7
  • 27
  • Possible duplicate of [How to read a Parquet file into Pandas DataFrame?](https://stackoverflow.com/questions/33813815/how-to-read-a-parquet-file-into-pandas-dataframe) – PV8 Sep 23 '19 at 11:24
  • https://stackoverflow.com/questions/33813815/how-to-read-a-parquet-file-into-pandas-dataframe – PV8 Sep 23 '19 at 11:24
  • No, the solution for that question, i.e. `pd.read_parquet(var_1, engine='fastparquet')` results in `TypeError: a bytes-like object is required, not 'str'` – AnarKi Sep 23 '19 at 11:25
  • `TypeError: a bytes-like object is required, not 'str'` is telling you that your `var_1` value is of string type. – monkut Sep 23 '19 at 11:32
  • `TypeError: not a path-like object` is telling you you need a file path like, `from pathlib import Path;myparquet_filepath = Path('/path/to/file')` – monkut Sep 23 '19 at 11:34
  • @monkut As I mentioned when I check the type it says `bytes`... this is really confusing me. I even tried pd.read_parquet(bytes(var_1), engine='fastparquet'). As for your other point, usually parquet files are stored in files(multiple) and so the function to read parquet expects a path to the file or files, which is not my case – AnarKi Sep 23 '19 at 11:46
  • Can you share more of the code section, it might help us to help you more easily. – monkut Sep 23 '19 at 11:49
  • So for the _pyarrow_ implementation you should give it a directory path, `The ParquetDataset class accepts either a directory name or a list or file paths,`. If you don't have the path how are you initially reading in the data to get the bytes? – monkut Sep 23 '19 at 11:51

2 Answers2

15

This was tested with Pandas 1.2.3

To read a parquet bytes object as a Pandas dataframe:

import io

import pandas as pd

pq_bytes = b'PAR1\x15\x02\x19\x1c5\x00\x18\x06schema\x15\x00\x00\x16\x00\x19\x1c\x19\x0c\x16\x00\x16\x00&\x00\x16\x00\x14\x00\x00\x19,\x18\x06pandas\x18\x8c\x01{"index_columns": [], "column_indexes": [], "columns": [], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.1.3"}\x00\x18\x0cARROW:schema\x18\xd8\x02//////gAAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAMQAAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAACcAAAABAAAAIwAAAB7ImluZGV4X2NvbHVtbnMiOiBbXSwgImNvbHVtbl9pbmRleGVzIjogW10sICJjb2x1bW5zIjogW10sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICIxLjAuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4xLjMifQAAAAAGAAAAcGFuZGFzAAAAAAAAAAAAAA==\x00\x18"parquet-cpp version 1.5.1-SNAPSHOT\x19\x0c\x00M\x02\x00\x00PAR1'
pq_file = io.BytesIO(pq_bytes)
df = pd.read_parquet(pq_file)

To write a Pandas dataframe to a bytes object:

import pandas as pd

df = pd.DataFrame()
df.to_parquet()
b'PAR1\x15\x04\x15\x00\x15\x02L\x15\x00\x15\x04\x12\x00\x00\x00&&\x1c\x15\x02\x195\x04\x00\x06\x19\x18\x11__index_level_0__\x15\x02\x16\x00\x16\x1c\x16\x1e&\x00&\x08)\x1c\x15\x04\x15\x04\x15\x02\x00\x00\x00\x15\x02\x19,5\x00\x18\x06schema\x15\x02\x00\x15\x02%\x02\x18\x11__index_level_0__l\xbc\x00\x00\x00\x16\x00\x19\x1c\x19\x1c&&\x1c\x15\x02\x195\x04\x00\x06\x19\x18\x11__index_level_0__\x15\x02\x16\x00\x16\x1c\x16\x1e&\x00&\x08)\x1c\x15\x04\x15\x04\x15\x02\x00\x00\x00\x16\x1e\x16\x00&&\x16\x1e\x14\x00\x00\x19,\x18\x06pandas\x18\xf6\x02{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "columns": [{"name": null, "field_name": "__index_level_0__", "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "3.0.0"}, "pandas_version": "1.2.3"}\x00\x18\x0cARROW:schema\x18\xec\x05/////ygCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAKwBAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAACEAQAABAAAAHYBAAB7ImluZGV4X2NvbHVtbnMiOiBbIl9faW5kZXhfbGV2ZWxfMF9fIl0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiAiX19pbmRleF9sZXZlbF8wX18iLCAicGFuZGFzX3R5cGUiOiAiZW1wdHkiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfV0sICJjcmVhdG9yIjogeyJsaWJyYXJ5IjogInB5YXJyb3ciLCAidmVyc2lvbiI6ICIzLjAuMCJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4yLjMifQAABgAAAHBhbmRhcwAAAQAAABQAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQEQAAAAKAAAAAQAAAAAAAAAEQAAAF9faW5kZXhfbGV2ZWxfMF9fAAAABAAEAAQAAAA=\x00\x18"parquet-cpp version 1.5.1-SNAPSHOT\x19\x1c\x1c\x00\x00\x00\x1f\x05\x00\x00PAR1'
Asclepius
  • 57,944
  • 17
  • 167
  • 143
8

You can do this by wrapping the bytes object in an pyarrow.BufferReader.

import pyarrow as pa
import pyarrow.parquet as pq

var_1 = …    
reader = pa.BufferReader(var_1)
table = pq.read_table(reader)
df = table.to_pandas()  # This results in a pandas.DataFrame
Uwe L. Korn
  • 8,080
  • 1
  • 30
  • 42