0

I have data in a pandas dataframe and I'm trying to separate and extract data out of a specific column col. The values in col are all lists of various sizes that store 4-value tuples (previous 4 key-value dictionaries). These values are always in the same relative order for the tuple.

For each of those tuples, I'd like to have a separate row in the final dataframe as well as having the respective value from the tuple stored in a new column.

The DataFrame df looks like this:

ID    col
A     [(123, 456, 111, False), (124, 456, 111, true), (125, 456, 111, False)]
B     []
C     [(123, 555, 333, True)]

I need to split col into four columns but also lengthen the dataframe for each record so each tuple has its own row in df2. DataFrame d2 should look like this:

ID   col1  col2  col3  col4
A    123   456   111   False
A    124   456   111   True
A    125   456   111   False
B    None  None  None  None
C    123   555   333   True

I have some sort of workaround loop-based code that seems to get the job done but I'd like to find a better and more efficient way that I can run on a huge data set. Perhaps using vectorization or NumPy if possible. Here's what I have so far:

import pandas as pd

df = pd.DataFrame({'ID': ['A', 'B', 'C'], 
                   'col': [[('123', '456', '111', False),
                            ('124', '456', '111', True),
                            ('125', '456', '111', False)],
                           [],
                           [('123', '555', '333', True)]]
                   })
final_rows = []

for index, row in df.iterrows():
    if not row.col:   # if list is empty
        final_rows.append(row.ID)
    for tup in row.col:
        new_row = [row.ID]
        vals = list(tup)
        new_row.extend(vals)
        final_rows.append(new_row)

df2 = pd.DataFrame(final_rows, columns=['ID', 'col1', 'col2', 'col3', 'col4'])
fastlanes
  • 323
  • 1
  • 3
  • 14

2 Answers2

4

Here is another solution, you can try out using explode + concat

df_ = df.explode('col').reset_index(drop=True)

pd.concat(
    [df_[['ID']], pd.DataFrame(df_['col'].tolist()).add_prefix('col')], axis=1
)

  ID col0  col1  col2   col3
0  A  123   456   111  False
1  A  124   456   111   True
2  A  125   456   111  False
3  B  NaN  None  None   None
4  C  123   555   333   True
sushanth
  • 8,275
  • 3
  • 17
  • 28
  • This works but is there a way to do it without using `explode`? Unfortunately, I'm not able to upgrade pandas in the target environment to the version that supports it. – fastlanes Apr 30 '21 at 13:30
  • Here are bunch of alternates, you might consider looking at https://stackoverflow.com/a/53218939/4985099 – sushanth Apr 30 '21 at 13:35
1

Try explode followed by apply ( pd.Series ) then merge back to the DataFrame:

import pandas as pd

df = pd.DataFrame({'ID': ['A', 'B', 'C'],
                   'col': [[('123', '456', '111', False),
                            ('124', '456', '111', True),
                            ('125', '456', '111', False)],
                           [],
                           [('123', '555', '333', True)]]
                   })
# Explode into Rows
new_df = df.explode('col').reset_index(drop=True)  

# Merge Back Together
new_df = new_df.merge(
    # Turn into Multiple Columns
    new_df['col'].apply(pd.Series),
    left_index=True,
    right_index=True) \
    .drop(columns=['col'])  # Drop Old Col Column

# Rename Columns
new_df.columns = ['ID', 'col1', 'col2', 'col3', 'col4']

# For Display
print(new_df)

Output:

  ID col1 col2 col3   col4
0  A  123  456  111  False
1  A  124  456  111   True
2  A  125  456  111  False
3  B  NaN  NaN  NaN    NaN
4  C  123  555  333   True
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • This works but is there a way to do it without using `explode`? Unfortunately, I'm not able to upgradepan das in the target environment to the version that supports it. – fastlanes Apr 30 '21 at 13:31