Match names, dialogues, and actions from transcript using regex

Question

Given a string dialogue such as below, I need to find the sentence that corresponds to each user.

text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'

For the above dialogue, I would like to return tuples with three elements with:

The name of the person
The sentence in lower case and
The sentences within Brackets

Something like this:

('CHRIS', 'Hello, how are you...', None)

('PETER', 'Great, you?', None)

('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')

('CHRIS', 'Are you ok?', None)

etc...

I am trying to use regex to achieve the above. So far I was able to get the names of the users with the below code. I am struggling to identify the sentence between two users.

actors = re.findall(r'\w+(?=\s*:[^/])',text)

This doesn't necessarily seem like a regex problem. Have you tried `str.split`? Also in your example output, what happened to `"Are you ok?"`? — pault, Dec 24 '18 at 15:24
@pault I tried to split it at ':' but again I have to identify the first word before the ':' and the whole sentence after the ':'. The sentence should stop before the name of the next user. Thats why I thought regex would be helpful. Added the last sentence as well. Thanks — pbou, Dec 24 '18 at 15:26
I'd look into `nltk` and stop writing custom regex. Thank me in 2 years :) — Ufos, Jun 05 '19 at 16:06

cs95 · Accepted Answer · 2018-12-24T15:36:02.670

You can do this with re.findall:

>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
 ('PETER', ' Great, you? ', ''),
 ('PAM',
  ' He is resting.',
  '[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
 ('CHRIS', ' Are you ok?', '')]

You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.

Regex Breakdown

\b              # Word boundary
(\S+)           # First capture group, string of characters not having a space
:               # Colon
(               # Second capture group
    [^          # Match anything that is not...
        :       #     a colon
        \[\]    #     or square braces
    ]+?         # Non-greedy match
)
\n?             # Optional newline
(               # Third capture group
    \[          # Literal opening brace
    [^:]+?      # Similar to above - exclude colon from match
    \] 
    \n?         # Optional newlines
)?              # Third capture group is optional
(?=             # Lookahead for... 
    \b          #     a word boundary, followed by  
    \S+         #     one or more non-space chars, and
    :           #     a colon
    |           # Or,
    $           # EOL
)

oh wow, the output looks exactly what I was looking for. For the square brackets I have this, just in case anyone needs it. re.sub(r'(\[|\])','',text). Thank you — pbou, Dec 24 '18 at 15:31

pault · Answer 2 · 2018-12-26T16:08:17.190

Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.

For example, we could first find groups of names and text:

from itertools import groupby

def isName(word):
    # Names end with ':'
    return word.endswith(":")

text_split = [
    " ".join(list(g)).rstrip(":") 
    for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']

Next you can collect pairs of consecutive elements in text_split into tuples:

print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]

We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)

Here's something quick that I came up with:

def isClosingBracket(word):
    return word.endswith("]")

def processWords(words):
    if "[" not in words:
        return [words, None]
    else:
        return [
            " ".join(g).replace("]", ".") 
            for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
        ]

print(
    [(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]

Note that using the * to unpack the result of processWords into the tuple is strictly a python 3 feature.

Thank you for this. It appears I have cases where there is not space between ']' and the name of the user. Therefore, word.endswith(':') doesn't split the name correctly. do you know how I could I cope with it? — pbou, Dec 25 '18 at 10:05
@pbou one way is to replace all `"]"` with `"] "` in the beginning. See the edit. — pault, Dec 26 '18 at 16:08

Match names, dialogues, and actions from transcript using regex

2 Answers2

Linked