1

here is an example of my df:

data
276 '83 Rally '83 (1983) (V)\t\t\t\t1983
277 '87: A Love Story (2007)\t\t\t\t2007                                                                                                   
278 '88 Dodge Aries (2002)\t\t\t\t\t2002
279 '9': Acting Out (2009) (V)\t\t\t\t2009

I would like to create a data frame showing only the titles and the year. Does anyone have any advice on how to go about parsing this? I think I may need to split the columns on \t\t\t\t

     Title                Year 
276 '83 Rally '83     (1983) 
277 '87: A Love Story (2007)                                                                                             
278 '88 Dodge Aries   (2002)
279 '9': Acting Out   (2009) 

Here is the dput

c("# (2014)\t\t\t\t\t\t2014", "#1 (2005)\t\t\t\t\t\t2005", "#1 (2009)\t\t\t\t\t\t2009", 
"#1 (2010)\t\t\t\t\t\t2010", "#1 (2010/I) (V)\t\t\t\t\t\t2010", 
"#1 (2010/II) (V)\t\t\t\t\t2010")
Jmira2312
  • 29
  • 3
  • How many columns do you actually have at the moment? 1? [A `dput` would be helpful.](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) – alistaire Feb 13 '17 at 02:21
  • Actually, the example that you give does not make the structure of your data.frame obvious. Could you please provide your data in a way that shows the structure? Please use `dput(df)` and paste the result into your question. If your data is very long, it will be fine to use `dput(head(df))` – G5W Feb 13 '17 at 02:22
  • @alistaire @G5W I only have one column named `data` at the moment. it contains strings of movie information (titles, dates released) Im not familiar with dput, but I ran this: `dput(head(df))` and I will put output in the question. – Jmira2312 Feb 13 '17 at 02:35

1 Answers1

1

Using gsub():

df$Title <- gsub("(.*?) \\(.*", "\\1", df$data)
df$Year  <- gsub(".*\\((\\d{4})\\).*", "\\1", df$data)

> df[c("Title", "Year")]
                  Title Year
1     276 '83 Rally '83 1983
2 277 '87: A Love Story 2007
3   278 '88 Dodge Aries 2002
4   279 '9': Acting Out 2009

Note: If data is actually a standalone vector, then just use it directly, e.g.

Title <- gsub("(.*?) \\(.*", "\\1", data)

Here is an explanation of the regex used to extract the year:

.*        match everything
\\(       up until the first parenthesis
(\\d{4})  then capture a four digit year
\\)       followed by a closing parenthesis
.*        consume the remainder of the string

The quantity \\1 used as a replacement in gsub() uses the four digit year which was captured during the match.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thank you. When I tried your code I got this error: `Error in df$data : $ operator is invalid for atomic vectors` – Jmira2312 Feb 13 '17 at 02:40
  • It sounds like `data` isn't a data frame, it's just a vector of strings. In this case just replace `df$data` with `data` in the code snippet I gave above. – Tim Biegeleisen Feb 13 '17 at 02:42
  • Thanks, I fixed my data and its now in a df. Do you mind explaining your regex expression - specifically the regex used in 'df$year' – Jmira2312 Feb 13 '17 at 02:56