Repeating Correct Number of Rows in Large Pandas DataFrames

Question

I have a big dataframe whose top five rows look like this:

                           Name Alternative name(s)  ...     Source       family
0                            SI                  S1  ...                  NaN        alpha
1                          OmIA                Om1a  ...               P0C1R7        alpha
2                           ImI                 NaN  ...                  NaN        alpha
3                          AnIB    [An1b, AnIB-NH2]  ...               P0C1V7        alpha
4                           GIA                 G1a  ...               P01519        alpha
..                          ...                 ...  ...                  ...          ...
216                       PiXXA                 NaN  ...                  NaN        alpha
217                       MilIA                 NaN  ...                  NaN        alpha
218                      SxIIIC                 NaN  ...                  NaN           mu
219                        C1.3               Ca004  ...                  NaN        alpha
220                        C6.2               Ca065  ...                  NaN        delta

[221 rows x 46 columns]

Specifically, the dataframe has a number of columns which contain list items of equal length per row; one such column is shown below:

0                       [α1β1γδ, α1β1γδ, α1β1γδ, α1β1γδ]
1      [α1β1δε, α2β2, α3β2, α3β4, α4β2, α6/α3β2, α7, ...
2      [α1β1δε, α1β1γδ, α2β2, α2β2, α2β4, α2β4, α3β2,...
3                               [α3β2, α7, AChBP, AChBP]
4                                                    NaN
                             ...                        
216                                       [α7, α3β2, α7]
217        [α1β1δε, α1β1γδ, α2β4, α4β2, α4β4, α7, α9α10]
218    [Nav1.1, Nav1.2, Nav1.3, Nav1.4, Nav1.5, Nav1....
219                                                  NaN
220                                                  NaN
Name: Target, Length: 221, dtype: object

I would like to repeat each row on this dataframe based on the size of the list in the corresponding position while picking only the corresponding element of the list for each repeated row. Here is the unnesting function defined in my Utilities module:

def unnesting(some_df: pd.DataFrame, non_explode: List) -> pd.DataFrame:
    """This function unnests (explodes) multiple columns in a pandas DataFrame"""
    for col_name in non_explode:
        if some_df[col_name].dtype == 'object':
            some_df[col_name] = some_df[col_name].astype(str).str.pad(width=100)
    return some_df.set_index(non_explode).apply(
        pd.Series.explode).reset_index()

Assuming that I am choosing all columns not containing list items as non_explode variable in the function, I am receiving the following error message:

Traceback (most recent call last):
  File "c:/Users/username/Desktop/Project/script.py", line 214, in <module>
    main()
  File "c:/Users/username/Desktop/Project/script.py", line 184, in main
    exploded_train = unnesting(train_data, NONACTIVITY_COLUMNS)
  File "c:\Users\username\Desktop\Project\Utilities.py", line 431, in unnesting
    return some_df.set_index(non_explode).apply(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 8740, in apply
    return op.apply()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py", line 688, in apply
    return self.apply_standard()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py", line 815, in apply_standard
    return self.wrap_results(results, res_index)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py", line 841, in wrap_results
    return self.wrap_results_for_axis(results, res_index)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py", line 909, in wrap_results_for_axis
    result = self.obj._constructor(data=results)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 614, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 464, in dict_to_mgr
    return arrays_to_mgr(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 124, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 571, in _homogenize
    val = val.reindex(index, copy=False)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 4580, in reindex
    return super().reindex(index=index, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 4818, in reindex
    return self._reindex_axes(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 4834, in _reindex_axes
    new_index, indexer = ax.reindex(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\multi.py", line 2524, in reindex
    raise ValueError("cannot handle a non-unique multi-index!")
ValueError: cannot handle a non-unique multi-index!

Thank you, I did look at that link. In their case, one column remains unchanged while in my case all columns change. — Ash, Jan 19 '22 at 00:40
Specifically, the portion that I don't know how to do is how to repeat the row indices correct number of times depending on the size of the list in the column. — Ash, Jan 19 '22 at 00:41
When I do `df.reindex(df.index.repeat(len(df.col2[0])))' it repeats all rows 3 times even for the ones that have different list size. — Ash, Jan 19 '22 at 00:45
I tested the code on your data, I am getting correct output. The only difference between the dupe and your df is that, in your df you need to set index on 2 columns, df.set_index(['col1','col4']).apply(pd.Series.explode).reset_index() — Vaishali, Jan 19 '22 at 00:57
I reopened the question, you can edit it to show what did you try and what error message did you get — Vaishali, Jan 19 '22 at 13:02

Repeating Correct Number of Rows in Large Pandas DataFrames

0 Answers0