1

Question

How to not trigger VisibleDeprecationWarning when splitting up a column containing nested ragged arrays into new columns in a Pandas DataFrame? A commonly accepted and straightforward way, or an explanation on why it is impossible for now is appreciated.

Terminology:

  • Array in this post refers to numpy.array.
  • Ragged means the items in "a collection of list-like objects" (i.e., list[list], array[array], list[tuple], etc.) are having unequal number of elements.
  • A "nested ragged array" in this post means the array is ragged in the deepest level, but the number of elements are equal within the two outermost levels respectively. In other words, it can be converted into a 2D array which contains list-like objects with potentially unequal lengths.

Survey of Existing Posts

I could not find a commonly accepted and straightforward way after an extensive survey as well as experiments done by myself. The two most relevant posts on SO at the time of posting are listed below.

  • This post suppresses the warning but does not point out what a good implementation should be. Suppression of warning is of course not generally considered as a good practice.
  • This post also focus on the warning messages but not the implementation.

These posts are at best marginally related to the problem: post1, post2, post3, post4.

Experiments

Sample sata and expected output

df = pd.DataFrame(
    data={
        "id": ['a', 'b', 'c'],
        "col1": [[[1, 2], [3, 4, 5]],
                 [[6], [7, 8, 9]],
                 [[10, 11, 12], []]
                 ]
    }
)

df
Out[81]: 
  id                 col1
0  a  [[1, 2], [3, 4, 5]]
1  b     [[6], [7, 8, 9]]
2  c   [[10, 11, 12], []]

One can see that df["col1"] has shape=(3, 2) for the two outermost levels. The expected output:

df  # expected output  
Out[177]: 
  id                 col1          sep1       sep2
0  a  [[1, 2], [3, 4, 5]]        [1, 2]  [3, 4, 5]
1  b     [[6], [7, 8, 9]]           [6]  [7, 8, 9]
2  c   [[10, 11, 12], []]  [10, 11, 12]         []

To save time, one can skip to the last subsection to begin with the working method directly. All relevant strategies I have tried are presented in chronological order below.

Main trial

The splitting function here produces a pd.Series of two-elemented tuples, which is reasonable.

df["col1"].apply(lambda el: (el[0], el[1]))
Out[82]: 
0    ([1, 2], [3, 4, 5])
1       ([6], [7, 8, 9])
2     ([10, 11, 12], [])
Name: col1, dtype: object

However, direct assignment into separate columns produces an ValueError.

df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1]))

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-75-973c44fe294a>", line 1, in <module>
    df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1]))
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3037, in __setitem__
    self._setitem_array(key, value)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3072, in _setitem_array
    self.iloc._setitem_with_indexer((slice(None), indexer), value)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1755, in _setitem_with_indexer
    "Must have equal len keys and value "
ValueError: Must have equal len keys and value when setting with an iterable

This can be avoided by casting the Series to a list using .tolist().

df["col1"].apply(lambda el: (el[0], el[1])).tolist()
Out[84]: [([1, 2], [3, 4, 5]), ([6], [7, 8, 9]), ([10, 11, 12], [])]

Now the direct assignment works correctly, but a VisibleDeprecationWarning pops out.

df[["sep1", "sep2"]] = df["col1"].apply(lambda el: (el[0], el[1])).tolist()

/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)

df  # this is expected
Out[86]: 
id                 col1          sep1       sep2
0  a  [[1, 2], [3, 4, 5]]        [1, 2]  [3, 4, 5]
1  b     [[6], [7, 8, 9]]           [6]  [7, 8, 9]
2  c   [[10, 11, 12], []]  [10, 11, 12]         []

list-zip-star method

Either ValueError or VisibleDeprecationWarning.

ls = df["col1"].apply(lambda el: (el[0], el[1])).tolist()
unpacked = list(zip(*ls))

df[["sep1", "sep2"]] = unpacked
# same ValueError message as above

df["sep1"] = unpacked[0]
# same VisibleDeprecationWarning message as above

list-map-list-zip-star method (works but...)

Just add another layer of list-map randomly. This time, the desired output can finally be obtained. But this is so counter-intuitive in the following ways:

  1. The new columns must be assigned individually. Why it cannot be done at once?
  2. The list-map-list-zip-star function is extremely brain twisting.

Am I really supposed to do this by design?

ls = df["col1"].apply(lambda el: (el[0], el[1])).tolist()
unpacked = list(map(list, zip(*ls)))  # a magical spell

df[["sep1", "sep2"]] = unpacked
# same ValueError message. Why?

# set the new columns individually.
df["sep1"] = unpacked[0]
df["sep2"] = unpacked[1]

df  # expected output  
Out[177]: 
  id                 col1          sep1       sep2
0  a  [[1, 2], [3, 4, 5]]        [1, 2]  [3, 4, 5]
1  b     [[6], [7, 8, 9]]           [6]  [7, 8, 9]
2  c   [[10, 11, 12], []]  [10, 11, 12]         []
Bill Huang
  • 4,491
  • 2
  • 13
  • 31

2 Answers2

2

Why not give DataFrame a try

df =  df.join(pd.DataFrame(df["col1"].apply(lambda el: (el[0], el[1])).tolist(), 
              index = df.index, 
              columns = ["sep1", "sep2"]))
BENY
  • 317,841
  • 20
  • 164
  • 234
1

How about:

df.join(pd.DataFrame(df['col1'].to_list(), 
                     columns=['sep1','sep2'],index=df.index) 
        )

Output:

  id                 col1          sep1       sep2
0  a  [[1, 2], [3, 4, 5]]        [1, 2]  [3, 4, 5]
1  b     [[6], [7, 8, 9]]           [6]  [7, 8, 9]
2  c   [[10, 11, 12], []]  [10, 11, 12]         []
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74