1

I would like to convert a DataFrame column (or Series) with lists that have different lengths into a DataFrame with a fixed number of columns.

The DataFrame will have as many columns as the longest list, and the values where the other lists are shorter can be NaN or anything.

The str module allows for this when the data comes as a string, with the option expand in str.split. But I have not been able to find an equivalents for lists of variable length.

In my example the type in the list is int, but the idea is to be able do it with any type. This prevents simply converting the Series to str and applying the mentioned expand attribute.

Below I show code to run the example with the list using the str.split funtion, and after a minimum example with the Series to be converted.

I found a solution using apply, shown in the example, but is so extremely slow that it is not useful.

import numpy as np
import pandas as pd

# Example with a list as a string
A = pd.DataFrame({'lists': [
                    '[]',
                    '[360,460,160]',
                    '[360,1,2,3,4,5,6]',
                    '[10,20,30]',
                    '[100,100,100,100]',
                    ],
                  'other': [1,2,3,4,5]
                 })
print(A['lists'].astype(str).str.strip('[]').str.split(',', expand=True))

# Example with actual lists
B = pd.DataFrame({'lists': [
                    [],
                    [360,460,160],
                    [360,1,2,3,4,5,6],
                    [10,20,30],
                    [100,100,100,100],
                ],
                  'other': [1,2,3,4,5]
                 })

# Create and pre-fill expected columns
max_len = max(B['lists'].str.len())
for idx in range(max_len):
    B[f'lists_{idx}'] = np.nan

# Use .apply to fill the columns
def expand_int_list(row, col, df):
    for idx, item in enumerate(row[col]):
        df.loc[row.name, f'{col}_{idx}'] = item
        
B.apply(lambda row: expand_int_list(row, 'lists', B), axis=1)
print(B)

Output:

     0     1     2     3     4     5     6
0       None  None  None  None  None  None
1  360   460   160  None  None  None  None
2  360     1     2     3     4     5     6
3   10    20    30  None  None  None  None
4  100   100   100   100  None  None  None
                     lists  other  lists_0  lists_1  lists_2  lists_3  \
0                       []      1      NaN      NaN      NaN      NaN   
1          [360, 460, 160]      2    360.0    460.0    160.0      NaN   
2  [360, 1, 2, 3, 4, 5, 6]      3    360.0      1.0      2.0      3.0   
3             [10, 20, 30]      4     10.0     20.0     30.0      NaN   
4     [100, 100, 100, 100]      5    100.0    100.0    100.0    100.0   

   lists_4  lists_5  lists_6  
0      NaN      NaN      NaN  
1      NaN      NaN      NaN  
2      4.0      5.0      6.0  
3      NaN      NaN      NaN  
4      NaN      NaN      NaN  

EDIT AND FINAL SOLUTION: An important piece of information that made the methods found in other questions fail is that in my data I have None sometimes instead of a list.

In that situation, using tolist() will yield a Series of lists again and Pandas will not allow to make those cells an empty list with B.loc[B[col].isna(), col] = [].

The solution I found is to use tolist() only in the rows that are not None, and concat using the original index:

# Example with actual lists
B = pd.DataFrame({'lists': [
                    [],
                    [360,460,160],
                    None,
                    [10,20,30],
                    [100,100,100,100],
                ],
                  'other': [1,2,3,4,5]
                 })

col = 'lists'
# I need to keep the index for the concat afterwards.
extended = pd.DataFrame(B.loc[~B[col].isna(), col].tolist(),
                        index=B.loc[~B[col].isna()].index)
extended = extended.add_prefix(f'{col}_')
B = pd.concat([B, extended], axis=1)

print(B)

Output:

                  lists  other  lists_0  lists_1  lists_2  lists_3
0                    []      1      NaN      NaN      NaN      NaN
1       [360, 460, 160]      2    360.0    460.0    160.0      NaN
2                  None      3      NaN      NaN      NaN      NaN
3          [10, 20, 30]      4     10.0     20.0     30.0      NaN
4  [100, 100, 100, 100]      5    100.0    100.0    100.0    100.0
rpicatoste
  • 479
  • 2
  • 16
  • Does this answer your question? [How to create a Pandas DataFrame from a list of lists with different lengths?](https://stackoverflow.com/questions/59783356/how-to-create-a-pandas-dataframe-from-a-list-of-lists-with-different-lengths) – Serial Lazer Nov 07 '20 at 07:04
  • Thank you @MayankPorwal for your suggestion, I thought it did not work with my case due to the variable length, but the reason was having None instead of a list sometimes. I'll update the question. – rpicatoste Nov 07 '20 at 08:45

1 Answers1

2

If convert nested lists to list and pass to DataFrame constructor missing values are added like longest lists, then DataFrame.add_prefix and append to original by DataFrame.join:

df = B.join(pd.DataFrame(B['lists'].tolist()).add_prefix('lists_'))
print (df)
                     lists  other  lists_0  lists_1  lists_2  lists_3  \
0                       []      1      NaN      NaN      NaN      NaN   
1          [360, 460, 160]      2    360.0    460.0    160.0      NaN   
2  [360, 1, 2, 3, 4, 5, 6]      3    360.0      1.0      2.0      3.0   
3             [10, 20, 30]      4     10.0     20.0     30.0      NaN   
4     [100, 100, 100, 100]      5    100.0    100.0    100.0    100.0   

   lists_4  lists_5  lists_6  
0      NaN      NaN      NaN  
1      NaN      NaN      NaN  
2      4.0      5.0      6.0  
3      NaN      NaN      NaN  
4      NaN      NaN      NaN  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you @jezrael for the reply. I tried .tolist() before and thought that it was not working due to the variable length lists, but it was that I have None instead of a list sometimes. I will update the question and add also the solution found, which starts from your reply. – rpicatoste Nov 07 '20 at 08:48