Write JSON to parquet file using pyarrow

Question

I'm running the following code

import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json

parquet_schema = schema = pyarrow.schema(
    [('id', pyarrow.string()),
     ('firstname', pyarrow.string()),
     ('lastname', pyarrow.string())])



user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'

writer = pq.ParquetWriter('user.parquet', schema=parquet_schema)

df = pd.DataFrame.from_dict(json.loads(user_json))
table = pyarrow.Table.from_pandas(df)
print(table.schema)
writer.write_table(table)
writer.close()

but I"m getting the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-a427a4cdd392> in <module>()
     15 writer = pq.ParquetWriter('user.parquet', schema=parquet_schema)
     16 
---> 17 df = pd.DataFrame.from_dict(json.loads(user_json))
     18 table = pyarrow.Table.from_pandas(df)
     19 print(table.schema)

4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
    385 
    386         if not indexes and not raw_lengths:
--> 387             raise ValueError("If using all scalar values, you must pass an index")
    388 
    389         if have_series:

ValueError: If using all scalar values, you must pass an index

Followed docs and tutorials, but I"m missing something.

pyarrow and pandas work on batch of records rather than record by record. If you only have one record, put it in a list: `pd.DataFrame.from_dict([json.loads(user_json)])`. It will work but it won't be very efficient and defeat the purpose of pyarrow/pandas. — 0x26res, Aug 31 '21 at 13:50

score 2 · Answer 1 · answered Aug 30 '21 at 08:32

Given that you are trying to work with columnar data the libraries you work with will expect that you are going to pass the rows for each column

I guess you aren't going to write a parquet file of a single row in real life, in such case you can just group the value by column and that will work with both pandas and arrow.

Also you can avoid using pandas at all and go through from_pydict method of pyarrow.Table

import pyarrow
import pyarrow.parquet as pq

users = {"id" : ["id1", "id2"], 
         "firstname": ["John", "Jack"], 
         "lastname": ["Doe", "Ryan"]}

table = pyarrow.Table.from_pydict(users)
print(table.schema)

with pq.ParquetWriter('user.parquet', schema=table.schema) as writer:
    writer.write_table(table)

See https://arrow.apache.org/cookbook/py/create.html#create-table-from-plain-types and https://arrow.apache.org/cookbook/py/io.html#write-a-parquet-file

score 0 · Answer 2 · answered Aug 29 '21 at 19:50

You have three options:

Stop using scalar values and have the values of your dict (from a json string) be lists.

import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json


user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)

# Make all values in the dict a list
for key, value in user_dict.items():
    user_dict[key] = [value]
df = pd.DataFrame(user_dict)

df.to_parquet('myfile.parquet')

or

Simply pass an index when loading scalar values (ex. 2 instead of [2])

import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json


user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)

# Pass an index instead
df = pd.DataFrame(user_dict, index=[0])
df.to_parquet('myfile.parquet')

or

Utilize `Dataframe.from_records

import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json


user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)

# Simply use `DataFrame.from_records`
df = pd.DataFrame.from_records(user_dict)
df.to_parquet('myfile.parquet')

The 3rd is the most simple but I'd probably getting the habit of passing scalar values into a DF and use a solution to the 1st option.

Read up more on the scalar issues at Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"

Write JSON to parquet file using pyarrow

2 Answers2