1

I have a csv file that looks like the following:

Halley Bailey - 1998 
Hayley Orrantia (1994-) American actress, singer, and songwriter 
Ken Watanabe (actor) 
etc...

I’d like to remove the items in the parentheses, as well as the commas in some of the names that have commas, so that the dataframe looks like this:

Halley Bailey
Hayley Orrantia
Ken Watanabe

I attempted using the following code, which succeeds in removing the dates after the name, but not the parentheses or things after commmas, how could I expand it so it can replace all these items?

regex = '|'.join(map(re.escape, df['actors']))
nic.o
  • 61
  • 8

2 Answers2

1

Try with the following '(^[^\(|^\-]+)' returning all matches before a - or (:

df['Full Name'] = df['Description'].str.extract('(^[^\(|^\-]+)')

Returning:

                                         Description        Full Name
0                               Halley Bailey - 1998    Halley Bailey 
1  Hayley Orrantia (1994-) American actress, sing...  Hayley Orrantia 
2                               Ken Watanabe (actor)     Ken Watanabe 
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
  • I'm getting the following error: KeyError: 'id' from the following code. The only column in the dataframe with the actors names is id `df = pd.read_csv('df.csv', on_bad_lines='skip', encoding="latin-1")` `df['id'] = df['id'].astype('|S')` `df.head()` – nic.o Nov 14 '22 at 17:36
1

Assuming that the csv content is in stored in the column csv of the dataframe df, and that df looks like the following (if one doesn't know how to read a CSV into a Pandas Dataframe, see first Notes below)

                                                 csv
0                               Halley Bailey - 1998
1  Hayley Orrantia (1994-) American actress, sing...
2                               Ken Watanabe (actor)

If one wants to create a new column named actors, considering that an actor full name is only composed of 2 words, the following will do the work

df['actors'] = df['csv'].str.split(' ').str[:2].str.join(' ')

[Out]:

                                                 csv           actors
0                               Halley Bailey - 1998    Halley Bailey
1  Hayley Orrantia (1994-) American actress, sing...  Hayley Orrantia
2                               Ken Watanabe (actor)     Ken Watanabe

If, on another hand, one doesn't want to create a new column, one can do the following

df['csv'] = df['csv'].str.split(' ').str[:2].str.join(' ')

[Out]:

               csv
0    Halley Bailey
1  Hayley Orrantia
2     Ken Watanabe

Notes:

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
  • Sorry but I don't understand why you use both 'actors' and 'csv', my dataframe only has one column ('id'). Should I just use 'id' both times? – nic.o Nov 14 '22 at 17:40
  • @NicoO `actors` is to create a new column named `actors`. As I do not know the column of your dataframe `df`, I assumed it was `csv`. If the column is named `id`, and in that column you have the content similar to the `csv` column in the Output of my answer, then it is a matter of changing in the script from `csv` to `id`. – Gonçalo Peres Nov 14 '22 at 17:44
  • This works, thanks Goncalo, I still get some of the names with a - (e.g. Ernest Hogen -). Do you know what I can add to the regex to fix this? – nic.o Nov 14 '22 at 21:44
  • Hard to come up with a regex that solves all the possible use cases without having access to the whole data. My recommendation is for you to ask a different question indicating specifically the cases that the solution(s) is/are not solving. – Gonçalo Peres Nov 15 '22 at 13:48