0

all. I am working on a personal NLP/NLU project using the nps_chat corpus. I am working on identifying all the questions asked and then doing some further analysis.

It is a rather large data set and is formatted as such:

Data columns (total 4 columns):
 #   Column               Dtype 
---  ------               ----- 
 0   episode              int64 
 1   episode_order        int64 
 2   speaker              object
 3   utterance            object
dtypes: int64(2), object(1)

For each episode, there are a series of utterances by speakers that are ordered in the episode_order column.

I've sentence tokenized each utterance and identified any questions in each utterance. These questions are stored in a 5th column called 'questions' as a list. Most rows have an empty list [], others range from a list of one question to a list of multiple questions asked in series.

What I am trying to solve: I'd like to elongate the data frame in rows where the utterance contained multiple questions. At each location where a row contains more than one question, i'd like to:

  1. leave only the first question asked in the original row
  2. add additional rows below the original row each containing one of the remaining questions in the list. The row is a copy of all columns in the original row except the 'questions' column contains the next question.

--Credit to the user below who answered-- Here is what I am trying to achieve.

import pandas as pd
df = pd.DataFrame(
     {
        "episodes" : [1, 2], 
        "utterance": ["hey", "ho"],
        "questions": [['Where?', "Who?"], ["What?", "When?"]]
     }
)

df
>>>
    episodes    utterance   questions
0   1           hey         [Where?, Who?]
1   2           ho          [What?, When?]

    episodes    utterance   questions
0   1           hey         Where?
0   1           hey         Who?
1   2           ho          What?
1   2           ho          When?

What is the best approach for this? I am trying to think through a apply/lambda solution. I've also thought about successively going through the data frame and carving out a whole episode, pass it into a function, elongate it as described and return it...then append it to a new data frame. There are 3M rows in this data set so, that could take a while.

Any advice is appreciated. Thanks!

  • 1
    Your question is unclear. Please [edit] to provide a [mcve] including sample input, expected output, and code for what you've tried so far, so that we can better understand how to help. Please see [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Jun 04 '21 at 21:24

1 Answers1

0

Perhaps this is what you are looking for?

import pandas as pd
df = pd.DataFrame(
    {
        "episodes" : [1, 2], 
        "utterance": ["hey", "ho"],
        "questions": [['Where?', "Who?"], ["What?", "When?"]]
    }
)

df
>>>
    episodes    utterance   questions
0   1           hey         [Where?, Who?]
1   2           ho          [What?, When?]


df.explode('questions')
>>>
    episodes    utterance   questions
0   1           hey         Where?
0   1           hey         Who?
1   2           ho          What?
1   2           ho          When?
nocibambi
  • 2,065
  • 1
  • 16
  • 22
  • 1
    Yes, thank you for both making my ask more clear for others and answering at the same time. I appreciate it and will try this. – Josh Willis Jun 04 '21 at 21:34