0

I am newbie in R programming, unfortunately I have to processing movieLens-1M data. In here I want to ask how I can split column at delimiter [::] in movies.dat. I have try this code:

> moviesDF<-read.delim("movies.dat", sep="|", header=F, stringsAsFactors=FALSE)
> str(moviesDF)
'data.frame':   3998 obs. of  3 variables:
 $ V1: chr  "1::Toy Story (1995)::Animation" "2::Jumanji (1995)::Adventure" "3::Grumpier Old Men (1995)::Comedy" "4::Waiting to Exhale (1995)::Comedy" ...
 $ V2: chr  "Children's" "Children's" "Romance" "Drama" ...
 $ V3: chr  "Comedy" "Fantasy" "" "" ...

The desired output is following below:

V1: Movie ID
V2: Title
V3: Genre

Additional, my aim is to provide recommendation system

markov zain
  • 11,987
  • 13
  • 35
  • 39
  • Here is the start `unlist(strsplit("1::Toy Story (1995)::Animation","::"))`, also see [stringr package](http://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) – zx8754 May 08 '15 at 13:33

2 Answers2

1

You can try cSplit from my "splitstackshape" package. Usage would be:

library(splitstackshape)
cSplit(moviesDF, "V1", "::")
#            V2      V3 V1_1                     V1_2      V1_3
# 1: Children's  Comedy    1         Toy Story (1995) Animation
# 2: Children's Fantasy    2           Jumanji (1995) Adventure
# 3:    Romance            3  Grumpier Old Men (1995)    Comedy
# 4:      Drama            4 Waiting to Exhale (1995)    Comedy
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
1

The problem is in the import function. read.delim(sep="|") is not properly reading the dataset because | only delimits the differents values you want in V3. You should import your dataset with readLines instead

moviesDF <- readLines("movies.dat")
moviesDF <- as.data.frame(do.call("rbind",strsplit(moviesDF,"::")),stringsAsFactors = FALSE)
names(moviesDF) <- c("V1","V2","V3")
scoa
  • 19,359
  • 5
  • 65
  • 80