3

I have a Pandas dataframe with a column containing dictionary values. I'd like to query this dataframe using DuckDB and convert the result to another dataframe, and have the type preserved across the query.

DuckDB has the MAP data type which looks like a good match for a dictionary, but when selecting the column it's turned into a VARCHAR, and results in a string-type column if I convert back to a dataframe.

Is there some way to preserve the type, or at least a good way to convert the string back to a dictionary when generating the new dataframe?

>>> # Create a dataframe with a column containing a dictionary
>>> df = pd.DataFrame([[{'some': 'dict', 'with': 'stuff'}]], columns=['mycol'])
>>> df
                               mycol
0  {'some': 'dict', 'with': 'stuff'}
>>> type(df['mycol'][0])
<class 'dict'>

>>> # Select that column using DuckDB - it becomes a VARCHAR
>>> duckdb.query('select mycol from df')
---------------------
-- Expression Tree --
---------------------
Subquery

---------------------
-- Result Columns  --
---------------------
- mycol (VARCHAR)

---------------------
-- Result Preview  --
---------------------
mycol
VARCHAR
[ Rows: 1]
{'some': 'dict', 'with': 'stuff'}


>>> # Converting the query result to another dataframe results in a string-type column
>>> df2 = duckdb.query('select mycol from df').to_df()
>>> df2
                               mycol
0  {'some': 'dict', 'with': 'stuff'}
>>> type(df2['mycol'][0])
<class 'str'>

>>> # An explicit cast to MAP doesn't work
>>> duckdb.query('select CAST(mycol as MAP(VARCHAR, VARCHAR)) from df')
---------------------
-- Expression Tree --
---------------------
Subquery

---------------------
-- Result Columns  --
---------------------
- CAST(mycol AS MAP<VARCHAR, VARCHAR>) (MAP<VARCHAR, VARCHAR>)

---------------------
-- Result Preview  --
---------------------
Conversion Error: Conversion Error: Unimplemented type for cast (VARCHAR -> MAP<VARCHAR, VARCHAR>)
Aron
  • 1,552
  • 1
  • 13
  • 34

1 Answers1

3

this is a known issue in DuckDB (https://github.com/duckdb/duckdb/issues/2450).

You can workaround this issue by using PyArrow.

import pyarrow as pa
import pandas as pd
import duckdb

df = pd.DataFrame([[{'some': 'dict', 'with': 'stuff'}]], columns=['mycol'])
con = duckdb.connect()
arrow_table = pa.Table.from_pandas(df)

con.execute("select * from arrow_table").fetchall()

Pedro Holanda
  • 261
  • 1
  • 3
  • This is a good suggestion, infact I was trying to learn duckdb and after stuggling with the documentation of I just completely switched to pyarrow. – Canute S Aug 01 '23 at 09:33