Separating text in r

Question

I have a data.frame that contains a column named movies_name. this column contain data as in this format: City of Lost Children, The (Cité des enfants perdus, La) (1995) I want to separate the year from the rest of the movie name without losing the text inside the brackets. to be more precise I want to create a new column holding the year and another one for the movie name alone.

I tried this approach but now I cannot gather back the movie name:

My approach

thanks

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please [do not post code or data in images](https://meta.stackoverflow.com/q/285551/2372064) — MrFlick, Jan 13 '23 at 16:19

Chris Ruehlemann · Accepted Answer · 2023-01-13T17:51:05.657

Try the function extract from tidyr(part of the tidyverse):

library(tidyverse)    
df %>%
  extract(movies_name,
          into = c("title", "year"), 
          regex = "(\\D+)\\s\\((\\d+)\\)")
                                                         title year
    1 City of Lost Children, The (Cité des enfants perdus, La) 1995
    2                                             another film 2020

How the regex works:

(\\D+): first capture group, matching one or more characters that are not digits
\\s\\(: a whitespace and an opening parenthesis (not captured)
(\\d+): second capture group, matching one or more `dìgits
\\): closing bracket (not captured)

Data 1:

df <- data.frame(
  movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
                  "another film (2020)")
)

EDIT:

Okay, following comment, let's make this a little more complex by including a title with digits (in the title!):

Data 2:

df <- data.frame(
  movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
                  "another film (2020)",
                  "Under Siege 2: Dark Territory (1995)")
)

Solution - actually easier than the previous one ;)

df %>%
  extract(movies_name,
          into = c("title", "year"), 
          regex = "(.+)\\s\\((\\d+)\\)")
                                                     title year
1 City of Lost Children, The (Cité des enfants perdus, La) 1995
2                                             another film 2020
3                            Under Siege 2: Dark Territory 1995

+1 for extract, though digits in titles are pretty common (MoviLens dataset for reference - https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset?select=movie.csv ) — margusl, Jan 13 '23 at 17:35
unfortnately, I am not eligable to upvote (need more than 15 reputation to upvote). — Nada Abbas, Jan 14 '23 at 15:12

score 0 · Answer 2 · answered Jan 13 '23 at 16:31

This looks for a number in round brackets at the end of the string, using stringr.

data.frame(movies, year = stringr::str_match(movies$movie, "\\((\\d+)\\)$")[,2])
                                                                   movie year
1 City of Lost Children, The (Cité des enfants (2002) perdus, La) (1995) 1995
2        City of Lost Children, The (Cité des enfants perdus, La) (1995) 1995

Data

movies <- structure(list(movie = c("City of Lost Children, The (Cité des enfants (2002) perdus, La) (1995)",
"City of Lost Children, The (Cité des enfants perdus, La) (1995)"
)), row.names = c(NA, -2L), class = "data.frame")

Separating text in r

2 Answers2

Data