0

I have a list of strings reflecting a conversation. I am trying to find a way to convert the conversation into a dataframe with columns for an index, the utterance (text) itself, and a speaker label column.

myconvo = ['Speaker1: this is one utterance', 
            'Speaker2: this is another utterance', 
            'Speaker1: this is a third utterance']

I assume that I will need to transform the list of strings into a list of lists, where each sub-list will comprise the speaker ID and the utterance.

So far I have used the below regular expression, but it is returning an extra blank object.

for i myconvo:
    a = re.split(r'(Speaker\d)', i, flags=re.MULTILINE)

['', 'Speaker1', ': this is one utterance']
['', 'Speaker2', ': this is another utterance']
['', 'Speaker1', ': this is a third utterance']

Worst case scenario I could just delete that first column, but I'm thinking there are clearly things I am doing that could be improved.

cookie1986
  • 865
  • 12
  • 27

0 Answers0