-2

I have a data.frame that contains a column named movies_name. this column contain data as in this format: City of Lost Children, The (Cité des enfants perdus, La) (1995) I want to separate the year from the rest of the movie name without losing the text inside the brackets. to be more precise I want to create a new column holding the year and another one for the movie name alone.

I tried this approach but now I cannot gather back the movie name:

My approach

thanks

  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please [do not post code or data in images](https://meta.stackoverflow.com/q/285551/2372064) – MrFlick Jan 13 '23 at 16:19

2 Answers2

1

Try the function extract from tidyr(part of the tidyverse):

library(tidyverse)    
df %>%
  extract(movies_name,
          into = c("title", "year"), 
          regex = "(\\D+)\\s\\((\\d+)\\)")
                                                         title year
    1 City of Lost Children, The (Cité des enfants perdus, La) 1995
    2                                             another film 2020

How the regex works:

  • (\\D+): first capture group, matching one or more characters that are not digits
  • \\s\\(: a whitespace and an opening parenthesis (not captured)
  • (\\d+): second capture group, matching one or more `dìgits
  • \\): closing bracket (not captured)

Data 1:

df <- data.frame(
  movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
                  "another film (2020)")
)

EDIT:

Okay, following comment, let's make this a little more complex by including a title with digits (in the title!):

Data 2:

df <- data.frame(
  movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
                  "another film (2020)",
                  "Under Siege 2: Dark Territory (1995)")
)

Solution - actually easier than the previous one ;)

df %>%
  extract(movies_name,
          into = c("title", "year"), 
          regex = "(.+)\\s\\((\\d+)\\)")
                                                     title year
1 City of Lost Children, The (Cité des enfants perdus, La) 1995
2                                             another film 2020
3                            Under Siege 2: Dark Territory 1995
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

This looks for a number in round brackets at the end of the string, using stringr.

data.frame(movies, year = stringr::str_match(movies$movie, "\\((\\d+)\\)$")[,2])
                                                                   movie year
1 City of Lost Children, The (Cité des enfants (2002) perdus, La) (1995) 1995
2        City of Lost Children, The (Cité des enfants perdus, La) (1995) 1995

Data

movies <- structure(list(movie = c("City of Lost Children, The (Cité des enfants (2002) perdus, La) (1995)",
"City of Lost Children, The (Cité des enfants perdus, La) (1995)"
)), row.names = c(NA, -2L), class = "data.frame")
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29