Splitting a column in a data frame?

Question

I've got this data frame with data from IMDb in it. One of the columns has the movie title with the year attached in parentheses. Looks like this:

The Shawshank Redemption (1994)

What I really want is to have the title and year separate. I've tried a couple of different things (split, strsplit), but I've had no success. I try to split on the first parentheses, but the two split functions don't seem to like non-character arguments. Anyone have any thoughts?

Try `strsplit(as.character(v1), '\\s*\$|\$')[[1]]` where `v1 <- 'The Shawshank Redemption (1994)'` I used `as.character` as I suspect your column might be `factor` class. — akrun, Sep 23 '15 at 14:45
Welcome to SO. Here's how to create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Include what you've already tried and why it didn't work. — Heroka, Sep 23 '15 at 14:45
Don't know if @akrun's solution works if there are parenthesis in the movie title — nicola, Sep 23 '15 at 14:50
@nicola In that case, we may use `strsplit(as.character(d1$v1), '\\s*\$(?=[0-9])|(?<=[0-9])\$', perl=TRUE)[[1]]` — akrun, Sep 23 '15 at 14:51
@akrun can you explain the ?<=[0-9] portion of your solution? (as compared to ?=[0-9]) — Andrew Taylor, Sep 23 '15 at 14:59
@AndrewTaylor It is a lookaround to match the `)` preceded by a number or a parentheses `(` followed by a number. — akrun, Sep 23 '15 at 15:01

akrun · Answer 1 · 2015-09-23T14:53:34.887

7

The strsplit works on character columns. So, if the column is factor class, we need to convert it to character class (as.character(..)). Here, I matching zero or more space (\\s*) followed by parenetheses (\$) or | the closing parentheses (\$) to split

strsplit(as.character(d1$v1), '\\s*\\(|\\)')[[1]]
#[1] "The Shawshank Redemption" "1994"

Or we can place the parentheses inside [] so that we don't have to escape \\ (as commented by @Avinash Raj)

strsplit(as.character(d1$v1), '\\s*[()]')[[1]]

data

v1 <- 'The Shawshank Redemption (1994)'
d1 <- data.frame(v1)

edited Sep 23 '15 at 14:53

answered Sep 23 '15 at 14:49

akrun

874,273
37
540
662

This worked, but returned a data.frame with two rows containing the titles and years instead of two columns. Still worked, though! – milk Sep 23 '15 at 15:54
@milk May be you need `do.call(rbind, strsplit(as.character(d1$v1), '\\s*[()]'))` I used `[[1]]` because there was only a single element. – akrun Sep 23 '15 at 16:00

score 3 · Answer 2 · answered Sep 23 '15 at 14:55

If you want to do an exact splitting (ie, splitting on the brcakets which exists at the last), you may try this.

x <- c("The Shawshank Redemption (1994)", "Kung(fu) Pa (23) nda (2010)")
strsplit(as.character(x), "\\s*\\((?=\\d+\\)$)|\\)$", perl=T)
# [[1]]
# [1] "The Shawshank Redemption" "1994"                    

# [[2]]
# [1] "Kung(fu) Pa (23) nda" "2010"

Ananta · Answer 3 · 2015-09-23T15:04:32.540

2

tidyr solution

df%>%separate(col,c("name", "year"), "[()]")

Thanks to Avinash, I can take his regular expression and apply in tidyr

m<-c("The Shawshank Redemption (1994)","The Shawshank (Redemption) (1994)", "Kung(fu) Pa (23) nda (2010)")
m2<-data.frame(m)
m2%>%separate(m,c("name", "year"), "\\s*\\((?=\\d+\\)$)|\\)$")

                        name year
1   The Shawshank Redemption 1994
2 The Shawshank (Redemption) 1994
3       Kung(fu) Pa (23) nda 2010

edited Sep 23 '15 at 15:04

answered Sep 23 '15 at 14:56

Ananta

3,671
3
22
26

I got an "Invalid column specification" when I tried this. Maybe I'm doing something wrong? – milk Sep 23 '15 at 15:52
in which code, do you have `df` dataframe with `col` column? – Ananta Sep 23 '15 at 17:02

score 0 · Accepted Answer · answered Sep 23 '15 at 14:56

0

Try the following code:

t(sapply(strsplit(c("The Shawshank Redemption (1994)"), '\\s*\\(|\\)'),rbind))

The above code will work if you just pass in the column of your data frame containing the title.

answered Sep 23 '15 at 14:56

FelixNNelson

161
4

Splitting a column in a data frame?

4 Answers4

data