DASK: merge throws error when one side's key is NA whereas pd.merge works

Question

I have these sample dataframes:

tdf1 = pd.DataFrame([{"id": 1, "val": 4}, {"id": 2, "val": 5}, {"id": 3, "val": 6}, {"id": pd.NA, "val": 7}, {"id": 4, "val": 8}])
tdf2 = pd.DataFrame([{"some_id": 1, "name": "Josh"}, {"some_id": 3, "name": "Jake"}])

pd.merge(tdf1, tdf2, how="left", left_on="id", right_on="some_id").head()

And the merge works perfectly. Now if I want to the same using Dask:

dd_tdf1 = dd.from_pandas(tdf1, npartitions=10)
dd_tdf2 = dd.from_pandas(tdf2, npartitions=10)

dd_tdf2.merge(dd_tdf1, left_on="some_id", right_on="id", how="right", npartitions=10).compute(scheduler="threads").head()

I get the following error:

File /opt/conda/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1585, in <genexpr>(.0)
   1581         return _get_no_sort_one_missing_indexer(left_n, False)
   1583 # get left & right join labels and num. of levels at each location
   1584 mapped = (
-> 1585     _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
   1586     for n in range(len(left_keys))
   1587 )
   1588 zipped = zip(*mapped)
   1589 llab, rlab, shape = (list(x) for x in zipped)

File /opt/conda/lib/python3.10/site-packages/pandas/core/reshape/merge.py:2313, in _factorize_keys(lk, rk, sort, how)
   2309 if is_integer_dtype(lk.dtype) and is_integer_dtype(rk.dtype):
   2310     # GH#23917 TODO: needs tests for case where lk is integer-dtype
   2311     #  and rk is datetime-dtype
   2312     klass = libhashtable.Int64Factorizer
-> 2313     lk = ensure_int64(np.asarray(lk))
   2314     rk = ensure_int64(np.asarray(rk))
   2316 elif needs_i8_conversion(lk.dtype) and is_dtype_equal(lk.dtype, rk.dtype):
   2317     # GH#23917 TODO: Needs tests for non-matching dtypes

File pandas/_libs/algos_common_helper.pxi:86, in pandas._libs.algos.ensure_int64()

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NAType'

Does anyone has an idea of why this might be happening?

can you show the full traceback? it looks like dask is casting the left side's `id` column as the same type as the right side's `some_id` column, since you've specified a right join, and this is then causing problems when trying to cast non-int values as int. you could start by just dropping NaNs on the left column... any reason you can't do that? — Michael Delgado, Oct 28 '22 at 16:57
pandas silently handles issues like this because it's working with the data in memory. all that dask sees is metadata, so it just knows that you want to merge two dataframes, with `join=right; right_index:int; left_index:object`, so it schedules the op to cast the left df's `id` column as an int and then when you get to actually doing the computation you run into trouble. in general, dask is pickier about types because it has to allocate the arrays on remote workers without seeing the data. — Michael Delgado, Oct 28 '22 at 17:01
thanks for the edit! when editing, go ahead and just change the question to clarify/streamline the question rather than leaving multiple "Edit" blocks. the goal is to have a clean, clear question & answer for others coming across the post. think wikipedia, not a blog. if you ever want to see the edit history, you can just click on the ["edited" link](/stackoverflow.com/posts/74237866/revisions) (above the last editor's profile badge). Also, I know it's *super* long, but always include the *full traceback* when asking about errors - all the way from `Traceback (most recent call last)`. Thanks! — Michael Delgado, Oct 28 '22 at 18:32

score 1 · Accepted Answer · answered Oct 28 '22 at 17:42

1

This expands on the comments by @Michael Delgado. The tricky thing is that int64 does not support the pd.NA. You can see that this error is raised by pandas via trying:

tdf1['id'] = tdf1['id'].astype('int64')
# TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NAType'

One fix is to make sure that the dtype of the id is set to Int64 (which supports pd.NA):

tdf1['id'] = tdf1['id'].astype('Int64')
tdf2['some_id'] = tdf2['some_id'].astype('Int64')

Once the dtypes are the same, you can run the dask snippet and obtain the desired merge.

answered Oct 28 '22 at 17:42

SultanOrazbayev

14,900
3
16
46

1

This actually was the solution, both set to Int64. Thanks for the answer and for your comments as well @MichaelDelgado, it was really helpful to understand the way Dask works. – Jorge Cespedes Oct 28 '22 at 17:59
1

nice! I always forget about the `Int` types :P – Michael Delgado Oct 28 '22 at 18:26

DASK: merge throws error when one side's key is NA whereas pd.merge works

1 Answers1