0

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored. Here are a couple of examples:

# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']

Here is what I've tried to no avail:

import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
vj1
  • 5
  • 2

3 Answers3

0

You can do this without using regular expression as well. Here is the code:

output = [x.split(' as')[0] for x in input.split('\n')]
0

The \n in the input string are new line characters. We can make use of this fact in our regex.

Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.

Using this info, we can write the regex like this:

^(?:[\w ]+?)(?:(?= as )|$)

First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.

In code,

output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)

Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".

Sweeper
  • 213,210
  • 22
  • 193
  • 313
  • I didn't know about lazy evaluation but that was what I was trying to achieve in my mind. Thank you. – vj1 Jan 29 '20 at 15:35
  • "lazy" in a regex context doesn't quite mean the same as "lazy" in "lazy evaluation". Have a look at the link in the answer. @vj1 – Sweeper Jan 29 '20 at 15:46
0

I guess you can combine the values obtained from two regex matches :

re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)

gives

[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]

from which you filter the empty strings out

output = list(map(lambda x : list(filter(len, x))[0], output))

gives

['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
Jarvis
  • 8,494
  • 3
  • 27
  • 58