0

I am pulling the data with the below code.

imdb_movie_data <-read.csv("https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv")

description

Now I want to remove the last term from each movie_title and for which I wrote the following code.

substr(imdb_movie_data, 1, (nchar(imdb_movie_data$movie_title)-1))

But this is not removing the last character from the columns. Let me know if anyone needs any clarification on this.

Jim
  • 2,974
  • 2
  • 19
  • 29
Akash Barnwal
  • 11
  • 1
  • 3
  • 1
    The first parameter needs to be `imdb_movie_data$movie_title` – alistaire Dec 01 '16 at 00:41
  • Tried but not able to remove the last term "Â". – Akash Barnwal Dec 01 '16 at 00:52
  • You need to make sure `movie_title` is a character vector. – JasonWang Dec 01 '16 at 00:53
  • This question could be improved by providing a small reproducible example. [Here are a few tips](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on how to do just that. Linking to off-site resources is not optimal as the data on the other end can go offline without notice. – Roman Luštrik Dec 01 '16 at 08:45

2 Answers2

1

Two problems:

1) imdb_movie_data$movie_title is not a character vector, but is rather a factor vector so needs to be converted to a character value with as.character

2) You need to assign a value to imdb_movie_data$movie_title if you want the conversion to have lasting effect:

imdb_movie_data$movie_title <- substr(as.character(imdb_movie_data$movie_title),
                       start= 1, 
                       stop= nchar(as.character(imdb_movie_data$movie_title) )-1 )

> head(imdb_movie_data$movie_title)
[1] "Avatar "                                                
[2] "Pirates of the Caribbean: At World's End "              
[3] "Spectre "                                               
[4] "The Dark Knight Rises "                                 
[5] "Star Wars: Episode VII - The Force Awakens             "
[6] "John Carter "      

In R the mere act of running a function has no effect on the arguments to the function. You need assignment back to the original vector if you want to make a change in values.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Surprisingly You have to run this code twice to get the output. Did you face the same? I ran it once and I didnt get the ouput but second time i got it. – Akash Barnwal Dec 04 '16 at 05:43
  • what if i use this code where I change the movie_title to character first and then call it. my code is as followed: 1st line -> imdb_movie_data$movie_title <- as.character(imdb_movie_data$movie_title) 2nd line -> imdb_movie_data$movie_title <- substr((imdb_movie_data$movie_title), start= 1, stop= nchar((imdb_movie_data$movie_title) )-1 ) – Akash Barnwal Dec 04 '16 at 05:48
  • Is there a better way to do this because the whole intention to do this is to remove the "Â" character and I can see some movies where WALL·E is coming in between. So how should i remove here!! A more simplified way would be to remove the Â. – Akash Barnwal Dec 04 '16 at 05:58
  • try getting some of this data into a txt file to see if the csv extension is mangling with the data. I would open this txt file in a vim editor to be able to highlight what is causing this issue. And if you find some character there, run a regex to remove the characters. And then use as.character of course to change your factor to character – thenakulchawla Dec 04 '16 at 09:10
1

The Easy way to go with this would be to us regex expressions.The following command could help-

imdb_movie_data$movie_title<-str_extract_all(imdb_movie_data$movie_title,"[A-Z a-z]+")

You end up getting all the characters other than the any special character.