Split columns at delimiter [::] in MovieLens-1M data in R

Question

I am newbie in R programming, unfortunately I have to processing movieLens-1M data. In here I want to ask how I can split column at delimiter [::] in movies.dat. I have try this code:

> moviesDF<-read.delim("movies.dat", sep="|", header=F, stringsAsFactors=FALSE)
> str(moviesDF)
'data.frame':   3998 obs. of  3 variables:
 $ V1: chr  "1::Toy Story (1995)::Animation" "2::Jumanji (1995)::Adventure" "3::Grumpier Old Men (1995)::Comedy" "4::Waiting to Exhale (1995)::Comedy" ...
 $ V2: chr  "Children's" "Children's" "Romance" "Drama" ...
 $ V3: chr  "Comedy" "Fantasy" "" "" ...

The desired output is following below:

V1: Movie ID
V2: Title
V3: Genre

Additional, my aim is to provide recommendation system

Here is the start `unlist(strsplit("1::Toy Story (1995)::Animation","::"))`, also see [stringr package](http://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) — zx8754, May 08 '15 at 13:33

score 1 · Accepted Answer · answered May 08 '15 at 13:40

You can try cSplit from my "splitstackshape" package. Usage would be:

library(splitstackshape)
cSplit(moviesDF, "V1", "::")
#            V2      V3 V1_1                     V1_2      V1_3
# 1: Children's  Comedy    1         Toy Story (1995) Animation
# 2: Children's Fantasy    2           Jumanji (1995) Adventure
# 3:    Romance            3  Grumpier Old Men (1995)    Comedy
# 4:      Drama            4 Waiting to Exhale (1995)    Comedy

scoa · Answer 2 · 2015-05-08T14:02:04.027

1

The problem is in the import function. read.delim(sep="|") is not properly reading the dataset because | only delimits the differents values you want in V3. You should import your dataset with readLines instead

moviesDF <- readLines("movies.dat")
moviesDF <- as.data.frame(do.call("rbind",strsplit(moviesDF,"::")),stringsAsFactors = FALSE)
names(moviesDF) <- c("V1","V2","V3")

edited May 08 '15 at 14:02

answered May 08 '15 at 13:43

scoa

19,359
5
65
80

Split columns at delimiter [::] in MovieLens-1M data in R

2 Answers2