I have a list of strings reflecting a conversation. I am trying to find a way to convert the conversation into a dataframe with columns for an index, the utterance (text) itself, and a speaker label column.
myconvo = ['Speaker1: this is one utterance',
'Speaker2: this is another utterance',
'Speaker1: this is a third utterance']
I assume that I will need to transform the list of strings into a list of lists, where each sub-list will comprise the speaker ID and the utterance.
So far I have used the below regular expression, but it is returning an extra blank object.
for i myconvo:
a = re.split(r'(Speaker\d)', i, flags=re.MULTILINE)
['', 'Speaker1', ': this is one utterance']
['', 'Speaker2', ': this is another utterance']
['', 'Speaker1', ': this is a third utterance']
Worst case scenario I could just delete that first column, but I'm thinking there are clearly things I am doing that could be improved.