all. I am working on a personal NLP/NLU project using the nps_chat corpus. I am working on identifying all the questions asked and then doing some further analysis.
It is a rather large data set and is formatted as such:
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 episode int64
1 episode_order int64
2 speaker object
3 utterance object
dtypes: int64(2), object(1)
For each episode, there are a series of utterances by speakers that are ordered in the episode_order column.
I've sentence tokenized each utterance and identified any questions in each utterance. These questions are stored in a 5th column called 'questions' as a list. Most rows have an empty list [], others range from a list of one question to a list of multiple questions asked in series.
What I am trying to solve: I'd like to elongate the data frame in rows where the utterance contained multiple questions. At each location where a row contains more than one question, i'd like to:
- leave only the first question asked in the original row
- add additional rows below the original row each containing one of the remaining questions in the list. The row is a copy of all columns in the original row except the 'questions' column contains the next question.
--Credit to the user below who answered-- Here is what I am trying to achieve.
import pandas as pd
df = pd.DataFrame(
{
"episodes" : [1, 2],
"utterance": ["hey", "ho"],
"questions": [['Where?', "Who?"], ["What?", "When?"]]
}
)
df
>>>
episodes utterance questions
0 1 hey [Where?, Who?]
1 2 ho [What?, When?]
episodes utterance questions
0 1 hey Where?
0 1 hey Who?
1 2 ho What?
1 2 ho When?
What is the best approach for this? I am trying to think through a apply/lambda solution. I've also thought about successively going through the data frame and carving out a whole episode, pass it into a function, elongate it as described and return it...then append it to a new data frame. There are 3M rows in this data set so, that could take a while.
Any advice is appreciated. Thanks!