2

I've got this data frame with data from IMDb in it. One of the columns has the movie title with the year attached in parentheses. Looks like this:

The Shawshank Redemption (1994)

What I really want is to have the title and year separate. I've tried a couple of different things (split, strsplit), but I've had no success. I try to split on the first parentheses, but the two split functions don't seem to like non-character arguments. Anyone have any thoughts?

Jaap
  • 81,064
  • 34
  • 182
  • 193
milk
  • 123
  • 5
  • 2
    Try `strsplit(as.character(v1), '\\s*\\(|\\)')[[1]]` where `v1 <- 'The Shawshank Redemption (1994)'` I used `as.character` as I suspect your column might be `factor` class. – akrun Sep 23 '15 at 14:45
  • 2
    Welcome to SO. Here's how to create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Include what you've already tried and why it didn't work. – Heroka Sep 23 '15 at 14:45
  • 1
    @akrun, could you please post your answer and explain it? – Soheil Sep 23 '15 at 14:46
  • 1
    Don't know if @akrun's solution works if there are parenthesis in the movie title – nicola Sep 23 '15 at 14:50
  • 2
    @nicola In that case, we may use `strsplit(as.character(d1$v1), '\\s*\\((?=[0-9])|(?<=[0-9])\\)', perl=TRUE)[[1]]` – akrun Sep 23 '15 at 14:51
  • @akrun can you explain the ?<=[0-9] portion of your solution? (as compared to ?=[0-9]) – Andrew Taylor Sep 23 '15 at 14:59
  • @AndrewTaylor It is a lookaround to match the `)` preceded by a number or a parentheses `(` followed by a number. – akrun Sep 23 '15 at 15:01

4 Answers4

7

The strsplit works on character columns. So, if the column is factor class, we need to convert it to character class (as.character(..)). Here, I matching zero or more space (\\s*) followed by parenetheses (\\() or | the closing parentheses (\\)) to split

strsplit(as.character(d1$v1), '\\s*\\(|\\)')[[1]]
#[1] "The Shawshank Redemption" "1994"         

Or we can place the parentheses inside [] so that we don't have to escape \\ (as commented by @Avinash Raj)

strsplit(as.character(d1$v1), '\\s*[()]')[[1]]

data

v1 <- 'The Shawshank Redemption (1994)'
d1 <- data.frame(v1)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This worked, but returned a data.frame with two rows containing the titles and years instead of two columns. Still worked, though! – milk Sep 23 '15 at 15:54
  • @milk May be you need `do.call(rbind, strsplit(as.character(d1$v1), '\\s*[()]'))` I used `[[1]]` because there was only a single element. – akrun Sep 23 '15 at 16:00
3

If you want to do an exact splitting (ie, splitting on the brcakets which exists at the last), you may try this.

x <- c("The Shawshank Redemption (1994)", "Kung(fu) Pa (23) nda (2010)")
strsplit(as.character(x), "\\s*\\((?=\\d+\\)$)|\\)$", perl=T)
# [[1]]
# [1] "The Shawshank Redemption" "1994"                    

# [[2]]
# [1] "Kung(fu) Pa (23) nda" "2010"
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
2

tidyr solution

df%>%separate(col,c("name", "year"), "[()]")

Thanks to Avinash, I can take his regular expression and apply in tidyr

m<-c("The Shawshank Redemption (1994)","The Shawshank (Redemption) (1994)", "Kung(fu) Pa (23) nda (2010)")
m2<-data.frame(m)
m2%>%separate(m,c("name", "year"), "\\s*\\((?=\\d+\\)$)|\\)$")

                        name year
1   The Shawshank Redemption 1994
2 The Shawshank (Redemption) 1994
3       Kung(fu) Pa (23) nda 2010
Ananta
  • 3,671
  • 3
  • 22
  • 26
0

Try the following code:

t(sapply(strsplit(c("The Shawshank Redemption (1994)"), '\\s*\\(|\\)'),rbind))

The above code will work if you just pass in the column of your data frame containing the title.

FelixNNelson
  • 161
  • 4