I would like to convert a DataFrame column (or Series) with lists that have different lengths into a DataFrame with a fixed number of columns.
The DataFrame will have as many columns as the longest list, and the values where the other lists are shorter can be NaN or anything.
The str module allows for this when the data comes as a string, with the option expand
in str.split
. But I have not been able to find an equivalents for lists of variable length.
In my example the type in the list is int, but the idea is to be able do it with any type. This prevents simply converting the Series to str and applying the mentioned expand
attribute.
Below I show code to run the example with the list using the str.split
funtion, and after a minimum example with the Series to be converted.
I found a solution using apply, shown in the example, but is so extremely slow that it is not useful.
import numpy as np
import pandas as pd
# Example with a list as a string
A = pd.DataFrame({'lists': [
'[]',
'[360,460,160]',
'[360,1,2,3,4,5,6]',
'[10,20,30]',
'[100,100,100,100]',
],
'other': [1,2,3,4,5]
})
print(A['lists'].astype(str).str.strip('[]').str.split(',', expand=True))
# Example with actual lists
B = pd.DataFrame({'lists': [
[],
[360,460,160],
[360,1,2,3,4,5,6],
[10,20,30],
[100,100,100,100],
],
'other': [1,2,3,4,5]
})
# Create and pre-fill expected columns
max_len = max(B['lists'].str.len())
for idx in range(max_len):
B[f'lists_{idx}'] = np.nan
# Use .apply to fill the columns
def expand_int_list(row, col, df):
for idx, item in enumerate(row[col]):
df.loc[row.name, f'{col}_{idx}'] = item
B.apply(lambda row: expand_int_list(row, 'lists', B), axis=1)
print(B)
Output:
0 1 2 3 4 5 6
0 None None None None None None
1 360 460 160 None None None None
2 360 1 2 3 4 5 6
3 10 20 30 None None None None
4 100 100 100 100 None None None
lists other lists_0 lists_1 lists_2 lists_3 \
0 [] 1 NaN NaN NaN NaN
1 [360, 460, 160] 2 360.0 460.0 160.0 NaN
2 [360, 1, 2, 3, 4, 5, 6] 3 360.0 1.0 2.0 3.0
3 [10, 20, 30] 4 10.0 20.0 30.0 NaN
4 [100, 100, 100, 100] 5 100.0 100.0 100.0 100.0
lists_4 lists_5 lists_6
0 NaN NaN NaN
1 NaN NaN NaN
2 4.0 5.0 6.0
3 NaN NaN NaN
4 NaN NaN NaN
EDIT AND FINAL SOLUTION:
An important piece of information that made the methods found in other questions fail is that in my data I have None
sometimes instead of a list.
In that situation, using tolist()
will yield a Series of lists again and Pandas will not allow to make those cells an empty list with B.loc[B[col].isna(), col] = []
.
The solution I found is to use tolist()
only in the rows that are not None, and concat
using the original index:
# Example with actual lists
B = pd.DataFrame({'lists': [
[],
[360,460,160],
None,
[10,20,30],
[100,100,100,100],
],
'other': [1,2,3,4,5]
})
col = 'lists'
# I need to keep the index for the concat afterwards.
extended = pd.DataFrame(B.loc[~B[col].isna(), col].tolist(),
index=B.loc[~B[col].isna()].index)
extended = extended.add_prefix(f'{col}_')
B = pd.concat([B, extended], axis=1)
print(B)
Output:
lists other lists_0 lists_1 lists_2 lists_3
0 [] 1 NaN NaN NaN NaN
1 [360, 460, 160] 2 360.0 460.0 160.0 NaN
2 None 3 NaN NaN NaN NaN
3 [10, 20, 30] 4 10.0 20.0 30.0 NaN
4 [100, 100, 100, 100] 5 100.0 100.0 100.0 100.0