I have a big dataframe whose top five rows look like this:
Name Alternative name(s) ... Source family
0 SI S1 ... NaN alpha
1 OmIA Om1a ... P0C1R7 alpha
2 ImI NaN ... NaN alpha
3 AnIB [An1b, AnIB-NH2] ... P0C1V7 alpha
4 GIA G1a ... P01519 alpha
.. ... ... ... ... ...
216 PiXXA NaN ... NaN alpha
217 MilIA NaN ... NaN alpha
218 SxIIIC NaN ... NaN mu
219 C1.3 Ca004 ... NaN alpha
220 C6.2 Ca065 ... NaN delta
[221 rows x 46 columns]
Specifically, the dataframe has a number of columns which contain list items of equal length per row; one such column is shown below:
0 [α1β1γδ, α1β1γδ, α1β1γδ, α1β1γδ]
1 [α1β1δε, α2β2, α3β2, α3β4, α4β2, α6/α3β2, α7, ...
2 [α1β1δε, α1β1γδ, α2β2, α2β2, α2β4, α2β4, α3β2,...
3 [α3β2, α7, AChBP, AChBP]
4 NaN
...
216 [α7, α3β2, α7]
217 [α1β1δε, α1β1γδ, α2β4, α4β2, α4β4, α7, α9α10]
218 [Nav1.1, Nav1.2, Nav1.3, Nav1.4, Nav1.5, Nav1....
219 NaN
220 NaN
Name: Target, Length: 221, dtype: object
I would like to repeat each row on this dataframe based on the size of the list in the corresponding position while picking only the corresponding element of the list for each repeated row. Here is the unnesting function defined in my Utilities module:
def unnesting(some_df: pd.DataFrame, non_explode: List) -> pd.DataFrame:
"""This function unnests (explodes) multiple columns in a pandas DataFrame"""
for col_name in non_explode:
if some_df[col_name].dtype == 'object':
some_df[col_name] = some_df[col_name].astype(str).str.pad(width=100)
return some_df.set_index(non_explode).apply(
pd.Series.explode).reset_index()
Assuming that I am choosing all columns not containing list items as non_explode
variable in the function, I am receiving the following error message:
Traceback (most recent call last):
File "c:/Users/username/Desktop/Project/script.py", line 214, in <module>
main()
File "c:/Users/username/Desktop/Project/script.py", line 184, in main
exploded_train = unnesting(train_data, NONACTIVITY_COLUMNS)
File "c:\Users\username\Desktop\Project\Utilities.py", line 431, in unnesting
return some_df.set_index(non_explode).apply(
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 8740, in apply
return op.apply()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py", line 688, in apply
return self.apply_standard()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py", line 815, in apply_standard
return self.wrap_results(results, res_index)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py", line 841, in wrap_results
return self.wrap_results_for_axis(results, res_index)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py", line 909, in wrap_results_for_axis
result = self.obj._constructor(data=results)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 464, in dict_to_mgr
return arrays_to_mgr(
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 571, in _homogenize
val = val.reindex(index, copy=False)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 4580, in reindex
return super().reindex(index=index, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 4818, in reindex
return self._reindex_axes(
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 4834, in _reindex_axes
new_index, indexer = ax.reindex(
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\multi.py", line 2524, in reindex
raise ValueError("cannot handle a non-unique multi-index!")
ValueError: cannot handle a non-unique multi-index!