2

I am applying a function to a Pandas DataFrame, and returning a tuple, to cast into multiple DataFrame columns using zip(* ).

The returned tuple, contains a list, containing one or more tuples.

In cases where at least one of the the nested lists contain a different count of tuples from the rest of the lists, everything works fine.

In rare cases where the function returns all nested lists with equal tuple counts within, an AssertionError: Shape of new values must be compatible with manager shape is raised.

I suspect Pandas is seeing the consistent nested list lengths and is trying to unpack the list(tuples) into separate columns.

How can I force Pandas to always store the returned list as is, regardless of the conditions above?


(Python 3.7.4, Pandas 1.0.3)

Code that works:

import pandas as pd
import numpy as np

def simple_function(type_count):
    calculated_value1 = np.random.randint(5)
    calculated_value2 = np.random.randint(5)
    types_list = [tuple((x, calculated_value2)) for x in range(0, type_count)]
    return calculated_value1, types_list
    
df = pd.DataFrame([{'name': 'Joe', 'types': 1},
                   {'name': 'Beth', 'types': 1},
                   {'name': 'John', 'types': 1},
                   {'name': 'Jill', 'types': 2},
                   ], columns=['name', 'types'])

df['calculated_result'], df['types_list'] = zip(*df['types'].apply(simple_function))

Code that raises AssertionError: Shape of new values must be compatible with manager shape:

import pandas as pd
import numpy as np

def simple_function(type_count):
    calculated_value1 = np.random.randint(5)
    calculated_value2 = np.random.randint(5)
    types_list = [tuple((x, calculated_value2)) for x in range(0, type_count)]
    return calculated_value1, types_list
    
df = pd.DataFrame([{'name': 'Joe', 'types': 1},
                   {'name': 'Beth', 'types': 1},
                   {'name': 'John', 'types': 1},
                   {'name': 'Jill', 'types': 1},
                   ], columns=['name', 'types'])

df['calculated_result'], df['types_list'] = zip(*df['types'].apply(simple_function))
Rocky K
  • 396
  • 2
  • 11
  • 1
    you can try `df[['calculated_result','types_list']] = pd.DataFrame(df['types'].apply(simple_function).tolist())` as a work around but it might exist a better solution – Ben.T Jun 21 '20 at 01:10
  • 1
    The reason I used zip is for its [performance advantages](https://stackoverflow.com/a/61039138/7696708). ```%timeit``` of ```zip(* )``` on ```simple_function``` yields ```553 µs ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)``` whereas your proposed solution is twice as slow with ```986 µs ± 8.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)``` That said, I'm grateful this solved my problem and I can return to work. If you post this as an answer, I'll mark this as answered. – Rocky K Jun 21 '20 at 02:22
  • It was a nice answer you did :) – Ben.T Jun 21 '20 at 23:10

1 Answers1

0

By creating a DataFrame from the list on your result:

df[['calculated_result','types_list']] = pd.DataFrame(df['types'].apply(simple_function).tolist())

You can get similar result with array

df['calculated_result'], df['types_list'] = np.array(df['types'].apply(simple_function).tolist()).T
Ben.T
  • 29,160
  • 6
  • 32
  • 54