Question
How to not trigger VisibleDeprecationWarning
when splitting up a column containing nested ragged arrays into new columns in a Pandas DataFrame? A commonly accepted and straightforward way, or an explanation on why it is impossible for now is appreciated.
Terminology:
- Array in this post refers to
numpy.array
. - Ragged means the items in "a collection of list-like objects" (i.e., list[list], array[array], list[tuple], etc.) are having unequal number of elements.
- A "nested ragged array" in this post means the array is ragged in the deepest level, but the number of elements are equal within the two outermost levels respectively. In other words, it can be converted into a 2D array which contains list-like objects with potentially unequal lengths.
Survey of Existing Posts
I could not find a commonly accepted and straightforward way after an extensive survey as well as experiments done by myself. The two most relevant posts on SO at the time of posting are listed below.
- This post suppresses the warning but does not point out what a good implementation should be. Suppression of warning is of course not generally considered as a good practice.
- This post also focus on the warning messages but not the implementation.
These posts are at best marginally related to the problem: post1, post2, post3, post4.
Experiments
Sample sata and expected output
df = pd.DataFrame(
data={
"id": ['a', 'b', 'c'],
"col1": [[[1, 2], [3, 4, 5]],
[[6], [7, 8, 9]],
[[10, 11, 12], []]
]
}
)
df
Out[81]:
id col1
0 a [[1, 2], [3, 4, 5]]
1 b [[6], [7, 8, 9]]
2 c [[10, 11, 12], []]
One can see that df["col1"]
has shape=(3, 2) for the two outermost levels. The expected output:
df # expected output
Out[177]:
id col1 sep1 sep2
0 a [[1, 2], [3, 4, 5]] [1, 2] [3, 4, 5]
1 b [[6], [7, 8, 9]] [6] [7, 8, 9]
2 c [[10, 11, 12], []] [10, 11, 12] []
To save time, one can skip to the last subsection to begin with the working method directly. All relevant strategies I have tried are presented in chronological order below.
Main trial
The splitting function here produces a pd.Series
of two-elemented tuples, which is reasonable.
df["col1"].apply(lambda el: (el[0], el[1]))
Out[82]:
0 ([1, 2], [3, 4, 5])
1 ([6], [7, 8, 9])
2 ([10, 11, 12], [])
Name: col1, dtype: object
However, direct assignment into separate columns produces an ValueError
.
df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1]))
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-75-973c44fe294a>", line 1, in <module>
df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1]))
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3037, in __setitem__
self._setitem_array(key, value)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3072, in _setitem_array
self.iloc._setitem_with_indexer((slice(None), indexer), value)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1755, in _setitem_with_indexer
"Must have equal len keys and value "
ValueError: Must have equal len keys and value when setting with an iterable
This can be avoided by casting the Series
to a list
using .tolist()
.
df["col1"].apply(lambda el: (el[0], el[1])).tolist()
Out[84]: [([1, 2], [3, 4, 5]), ([6], [7, 8, 9]), ([10, 11, 12], [])]
Now the direct assignment works correctly, but a VisibleDeprecationWarning
pops out.
df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1])).tolist()
/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order)
df # this is expected
Out[86]:
id col1 sep1 sep2
0 a [[1, 2], [3, 4, 5]] [1, 2] [3, 4, 5]
1 b [[6], [7, 8, 9]] [6] [7, 8, 9]
2 c [[10, 11, 12], []] [10, 11, 12] []
list-zip-star method
Either ValueError
or VisibleDeprecationWarning
.
ls = df["col1"].apply(lambda el: (el[0], el[1])).tolist()
unpacked = list(zip(*ls))
df[["sep1", "sep2"]] = unpacked
# same ValueError message as above
df["sep1"] = unpacked[0]
# same VisibleDeprecationWarning message as above
list-map-list-zip-star method (works but...)
Just add another layer of list-map
randomly. This time, the desired output can finally be obtained. But this is so counter-intuitive in the following ways:
- The new columns must be assigned individually. Why it cannot be done at once?
- The list-map-list-zip-star function is extremely brain twisting.
Am I really supposed to do this by design?
ls = df["col1"].apply(lambda el: (el[0], el[1])).tolist()
unpacked = list(map(list, zip(*ls))) # a magical spell
df[["sep1", "sep2"]] = unpacked
# same ValueError message. Why?
# set the new columns individually.
df["sep1"] = unpacked[0]
df["sep2"] = unpacked[1]
df # expected output
Out[177]:
id col1 sep1 sep2
0 a [[1, 2], [3, 4, 5]] [1, 2] [3, 4, 5]
1 b [[6], [7, 8, 9]] [6] [7, 8, 9]
2 c [[10, 11, 12], []] [10, 11, 12] []