I have a dataframe
that contains all of my data, some 26k entries. From this, I have created three subsets; test, training and validation data sets. Each of these sets have corresponding corpuses
and document term matrices
produced.
A sample of the raw data matrix is below. As you can see, the unique ID is column 4.
An example of one of the document term matrices is below. The row names obviously correspond to the ID column, tweet_ID, in the raw data above.
What I'm trying to do is to use the unique IDs, which are the row names in the document term matrices, to refer back to the raw data matrix to extract the label. The label is the 'stance' column, column 10, in the raw data.
I then want to append the label as an additional column to my document term matrices, either before or after the columns for the terms.
I've tried using merge
and grepl
but I encountered various errors to do with row name issues. I looked at this question but the solution there, using merge
, didn't work.
My code is working fine up to this point, but I've included it below in case it's of use to anyone.
Main script
#Imports
getImports <- function(){
library(jsonlite)
library(tm)
library(fpc)
library(stringr)
library(SnowballC)
}
#Main Program
run <- function(){
#Includes
source("./Classifier_Functions.R")
#Get the raw data
rawData <- data.frame(getInput())
rawData$tweet <- sapply(rawData$tweet,function(row) iconv(row, "latin1", "ASCII", sub=""))
# Set custom reader
aReader <- readTabular(mapping=list(content="tweet", id="tweet_ID"))
#Set the data specification to 70% training, 20% test and 10% validate
spec = c(train = .7, test = .2, validate = .1)
#Set the seed for reproducable results when using random generators
set.seed(1)
#Sample the raw data
g=sample(cut(
seq(nrow(rawData)),
nrow(rawData)*cumsum(c(0,spec)),
labels = names(spec)
))
#Split the raw data into the three groups specified
groups = split(rawData, g)
#Generate clean corpuses for all data sets
testCorpus <- createCorpus(groups$test,aReader)
trainCorpus <- createCorpus(groups$train,aReader)
validateCorpus <- createCorpus(groups$validate,aReader)
#Produce 'bag of words' Document Term Matrices for test, training and validation data
testDTM <- getDocumentTermMatrix(testCorpus)
trainDTM <- getDocumentTermMatrix(trainCorpus)
validateDTM <- getDocumentTermMatrix(validateCorpus)
#Need to extract labels from raw data and append to document term matrices here.
}
Functions file
#Get raw input data from JSON file
getInput <- function() {
#Import data
json_file <-"path_to_file"
#Set data to dataframe
frame <- fromJSON(json_file)
cat("\nImported data.")
return(frame)
}
#Produces a corpus
createCorpus <- function(data, r){
corpus <- VCorpus(DataframeSource(data), readerControl = list(reader = r))
corpus <- tm_map(corpus, content_transformer(tolower)) cat("\nTransformed to lowercase.")
corpus <- tm_map(corpus, stripWhitespace) cat("\nWhitespace stripped.")
corpus <- tm_map(corpus, removePunctuation) cat("\nPunctuation removed.")
corpus <- tm_map(corpus, removeNumbers) cat("\nNumbers removed.")
corpus <- tm_map(corpus, removeWords, stopwords('english')) cat("\nStopwords removed.")
corpus <- tm_map(corpus, stemDocument) cat("\nStemming complete.")
cat("\nCorpus creattion complete.") return(corpus) }
#Generate Document Term Matrix
getDocumentTermMatrix<-function(corpus){
dtm <- DocumentTermMatrix(corpus)
dtm<-removeSparseTerms(dtm, 0.9)
cat("\nDTM produced.")
return(dtm)
}
#Generate tf-idf Document Term Matrix
getTfIdfMatrix<-function(corpus){
tfIdf <-DocumentTermMatrix(corpus, control=list(weighting = weightTfIdf()))
cat("\nTf-Idf computation complete.")
return(tfIdf)
}
EDIT: As suggested in the comments, the dput
output for recreating the raw data and one example DTM can be found in this Google Drive folder. The dput()
output is far too large for me to share in plain text here.
EDIT 2: Attempts at using merge
include
merge(rawData[c(10)], data.frame(as.matrix(testDTM)), by="tweet_ID", all.x=TRUE)
and
merge(rawData[c(10)], data.frame(as.matrix(testDTM)), by.x="tweet_ID", by.y="row.names")
EDIT 3: Added dput()
in plain text for the first 5 rows of the raw data.
Raw data
structure(list(numRetweets = c(1L, 339L, 1L, 179L, 0L), numFavorites = c(2L, 178L, 2L, 152L, 0L), username = c("iainastewart", "DavidNuttallMP", "DavidNuttallMP", "DavidNuttallMP", "DavidNuttallMP"), tweet_ID = c("745870298600316929", "740663385214324737", "741306107059130368", "742477469983363076", "743146889596534785"), tweet_length = c(140L, 118L, 140L, 139L, 63L), tweet = c("RT @carolemills77: Many thanks to all the @mkcouncil #EUref staff who are already in the polling stations ready to open at 7am and the Elec", "RT @BetterOffOut: If you agree with @DanHannanMEP, please RT. #VoteLeave #Brexit #BetterOffOut ", "RT @iaingartside: Out with @DavidNuttallMP @DeneVernon @CllrSueNuttall Campaigning to \"Leave\" in the EU ref in Bury Today #Brexit https://t", "RT @simplysimontfa: Don't be distracted by good opinion polls 4 Leave; the only way to get our country back is to maximise the Brexit vote", "@GrumpyPete Just a little light relief #BetterOffOut #VoteLeave" ), number_hashtags = c(1L, 3L, 1L, 1L, 2L), number_URLs = c(0L, 0L, 0L, 0L, 0L), sentiment_score = c(2L, 2L, -1L, 0L, 0L), stance = c("leave", "leave", "leave", "leave", "leave")), .Names = c("numRetweets", "numFavorites", "username", "tweet_ID", "tweet_length", "tweet", "number_hashtags", "number_URLs", "sentiment_score", "stance" ), row.names = c(NA, 5L), class = "data.frame")
Document term matrix