disassemble and reassemble strings based on list

Question

I have four speakers like this:

Team_A=[Fred,Bob]

Team_B=[John,Jake]

They are having a conversation and it is all represented by a string, ie. convo=

Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine

How do I disassemble and reassemble the string so I can split it into 2 strings, 1 string of what Team_A said, and 1 string from what Team_A said?

output: team_A_said="hello how is it going?", team_B_said="hi we are doing fine"

The lines don't matter.

I have this awful find... then slice code that is not scalable. Can someone suggest something else? Any libraries to help with this?

I didn't find anything in nltk library

Does the `convo` string always consist of blocks of the form `name\nstuff they said\n\n`? Will it only contain 1 block for each person, or can there be a large number of blocks? — PM 2Ring, Oct 15 '15 at 09:54

score 2 · Accepted Answer · edited May 23 '17 at 12:14

This code assumes that contents of convo strictly conforms to the
name\nstuff they said\n\n
pattern. The only tricky code it uses is zip(*[iter(lines)]*3), which creates a list of triplets of strings from the lines list. For a discussion on this technique and alternate techniques, please see How do you split a list into evenly sized chunks in Python?.

#!/usr/bin/env python

team_ids = ('A', 'B')

team_names = (
    ('Fred', 'Bob'),
    ('John', 'Jake'),
)

#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
    for name in names:
        teams[name] = team_id


#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine

'''

lines = convo.splitlines()

#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
    team_id = teams[name]
    said[team_id].append(text)

for team_id in team_ids:
    print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))

output

Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'

Martin Evans · Answer 2 · 2015-10-15T10:28:56.507

You could use a regular expression to split up each entry. itertools.ifilter can then be used to extract the required entries for each conversation.

import itertools
import re

def get_team_conversation(entries, team):
    return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]

Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']

convo = """
Fred
hello

John
hi

Bob
how is it going?

Jake
we are doing fine"""

find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]

print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)

Giving the following output:

Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']

Jiby · Answer 3 · 2015-10-15T10:07:33.047

It is a problem of language parsing.

Answer is a Work in progress

Finite state machine

A conversation transcript can be understood by imagining it as parsed by automata with the following states :

[start]  ---> [Name]----> [Text]-+----->[end]
               ^                 |
               |                 | (whitespaces)
               +-----------------+

You can parse your conversation by making it follow that state machine. If your parsing succeeds (ie. follows the states to end of text) you can browse your "conversation tree" to derive meaning.

Tokenizing your conversation (lexer)

You need functions to recognize the name state. This is straightforward

name = (Team_A | Team_B) + '\n'

Conversation alternation

In this answer, I did not assume that a conversation involves alternating between the people speaking, like this conversation would :

Fred     # author 1
hello

John     # author 2
hi

Bob      # author 3
how is it going ?

Bob      # ERROR : author 3 again !
are we still on for saturday, Fred ?

This might be problematic if your transcript concatenates answers from same author

but don't you first have to find and slice everything? then reassemble it? — jason, Oct 15 '15 at 10:06
Not directly : You only go through the text once while following the automata (and check for inclusion of text in given names, not calling `find`), whereas the find / slice method is more on the quadratic lookup side of things (looking all around the text backwards and forward). Does that answer your question ? — Jiby, Oct 15 '15 at 10:09

disassemble and reassemble strings based on list

3 Answers3

Finite state machine

Tokenizing your conversation (lexer)

Conversation alternation