Elongating a Data Frame in Pandas

Question

all. I am working on a personal NLP/NLU project using the nps_chat corpus. I am working on identifying all the questions asked and then doing some further analysis.

It is a rather large data set and is formatted as such:

Data columns (total 4 columns):
 #   Column               Dtype 
---  ------               ----- 
 0   episode              int64 
 1   episode_order        int64 
 2   speaker              object
 3   utterance            object
dtypes: int64(2), object(1)

For each episode, there are a series of utterances by speakers that are ordered in the episode_order column.

I've sentence tokenized each utterance and identified any questions in each utterance. These questions are stored in a 5th column called 'questions' as a list. Most rows have an empty list [], others range from a list of one question to a list of multiple questions asked in series.

What I am trying to solve: I'd like to elongate the data frame in rows where the utterance contained multiple questions. At each location where a row contains more than one question, i'd like to:

leave only the first question asked in the original row
add additional rows below the original row each containing one of the remaining questions in the list. The row is a copy of all columns in the original row except the 'questions' column contains the next question.

--Credit to the user below who answered-- Here is what I am trying to achieve.

import pandas as pd
df = pd.DataFrame(
     {
        "episodes" : [1, 2], 
        "utterance": ["hey", "ho"],
        "questions": [['Where?', "Who?"], ["What?", "When?"]]
     }
)

df
>>>
    episodes    utterance   questions
0   1           hey         [Where?, Who?]
1   2           ho          [What?, When?]

    episodes    utterance   questions
0   1           hey         Where?
0   1           hey         Who?
1   2           ho          What?
1   2           ho          When?

What is the best approach for this? I am trying to think through a apply/lambda solution. I've also thought about successively going through the data frame and carving out a whole episode, pass it into a function, elongate it as described and return it...then append it to a new data frame. There are 3M rows in this data set so, that could take a while.

Any advice is appreciated. Thanks!

Your question is unclear. Please [edit] to provide a [mcve] including sample input, expected output, and code for what you've tried so far, so that we can better understand how to help. Please see [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — G. Anderson, Jun 04 '21 at 21:24

score 0 · Accepted Answer · answered Jun 04 '21 at 21:30

Perhaps this is what you are looking for?

import pandas as pd
df = pd.DataFrame(
    {
        "episodes" : [1, 2], 
        "utterance": ["hey", "ho"],
        "questions": [['Where?', "Who?"], ["What?", "When?"]]
    }
)

df
>>>
    episodes    utterance   questions
0   1           hey         [Where?, Who?]
1   2           ho          [What?, When?]


df.explode('questions')
>>>
    episodes    utterance   questions
0   1           hey         Where?
0   1           hey         Who?
1   2           ho          What?
1   2           ho          When?

Yes, thank you for both making my ask more clear for others and answering at the same time. I appreciate it and will try this. — Josh Willis, Jun 04 '21 at 21:34

Elongating a Data Frame in Pandas

1 Answers1