0

I have a dataframe that contains all of my data, some 26k entries. From this, I have created three subsets; test, training and validation data sets. Each of these sets have corresponding corpuses and document term matrices produced.

A sample of the raw data matrix is below. As you can see, the unique ID is column 4.

Raw data table

An example of one of the document term matrices is below. The row names obviously correspond to the ID column, tweet_ID, in the raw data above.

Document term matrix

What I'm trying to do is to use the unique IDs, which are the row names in the document term matrices, to refer back to the raw data matrix to extract the label. The label is the 'stance' column, column 10, in the raw data.

I then want to append the label as an additional column to my document term matrices, either before or after the columns for the terms.

I've tried using merge and grepl but I encountered various errors to do with row name issues. I looked at this question but the solution there, using merge, didn't work.

My code is working fine up to this point, but I've included it below in case it's of use to anyone.

Main script

#Imports
getImports <- function(){
    library(jsonlite)
    library(tm)
    library(fpc)
    library(stringr)
    library(SnowballC)
}

#Main Program
run <- function(){
    #Includes
  source("./Classifier_Functions.R")

  #Get the raw data
  rawData <- data.frame(getInput())

  rawData$tweet <- sapply(rawData$tweet,function(row) iconv(row, "latin1", "ASCII", sub=""))

  # Set custom reader
  aReader <- readTabular(mapping=list(content="tweet", id="tweet_ID"))

  #Set the data specification to 70% training, 20% test and 10% validate
  spec = c(train = .7, test = .2, validate = .1)

  #Set the seed for reproducable results when using random generators
  set.seed(1)

  #Sample the raw data
  g=sample(cut(
    seq(nrow(rawData)), 
    nrow(rawData)*cumsum(c(0,spec)), 
    labels = names(spec)
  ))

  #Split the raw data into the three groups specified
  groups = split(rawData, g)

  #Generate clean corpuses for all data sets
  testCorpus <- createCorpus(groups$test,aReader)
  trainCorpus <- createCorpus(groups$train,aReader)
  validateCorpus <- createCorpus(groups$validate,aReader)

  #Produce 'bag of words' Document Term Matrices for test, training and validation data
  testDTM <- getDocumentTermMatrix(testCorpus)
  trainDTM <- getDocumentTermMatrix(trainCorpus)
  validateDTM <- getDocumentTermMatrix(validateCorpus)

  #Need to extract labels from raw data and append to document term matrices here.
}

Functions file

#Get raw input data from JSON file
getInput <- function() {   
   #Import data   
   json_file <-"path_to_file"
   #Set data to dataframe   
   frame <- fromJSON(json_file)   
   cat("\nImported data.")   
   return(frame) 
}

#Produces a corpus 
createCorpus <- function(data, r){   
    corpus <- VCorpus(DataframeSource(data), readerControl = list(reader = r))
    corpus <- tm_map(corpus, content_transformer(tolower))   cat("\nTransformed to lowercase.")
    corpus <- tm_map(corpus, stripWhitespace)   cat("\nWhitespace stripped.")
    corpus <- tm_map(corpus, removePunctuation)   cat("\nPunctuation removed.")
    corpus <- tm_map(corpus, removeNumbers)   cat("\nNumbers removed.")
    corpus <- tm_map(corpus, removeWords, stopwords('english'))   cat("\nStopwords removed.")
    corpus <- tm_map(corpus, stemDocument)   cat("\nStemming complete.")
    cat("\nCorpus creattion complete.")   return(corpus) }

#Generate Document Term Matrix 
getDocumentTermMatrix<-function(corpus){   
    dtm <- DocumentTermMatrix(corpus)   
    dtm<-removeSparseTerms(dtm, 0.9)   
    cat("\nDTM produced.")   
    return(dtm) 
}

#Generate tf-idf Document Term Matrix 
getTfIdfMatrix<-function(corpus){   
    tfIdf <-DocumentTermMatrix(corpus, control=list(weighting = weightTfIdf()))   
    cat("\nTf-Idf computation complete.")   
    return(tfIdf) 
}

EDIT: As suggested in the comments, the dput output for recreating the raw data and one example DTM can be found in this Google Drive folder. The dput() output is far too large for me to share in plain text here.

EDIT 2: Attempts at using merge include

merge(rawData[c(10)], data.frame(as.matrix(testDTM)), by="tweet_ID", all.x=TRUE) 

and

merge(rawData[c(10)], data.frame(as.matrix(testDTM)), by.x="tweet_ID", by.y="row.names")

EDIT 3: Added dput() in plain text for the first 5 rows of the raw data.

Raw data

structure(list(numRetweets = c(1L, 339L, 1L, 179L, 0L), numFavorites = c(2L,  178L, 2L, 152L, 0L), username = c("iainastewart", "DavidNuttallMP",  "DavidNuttallMP", "DavidNuttallMP", "DavidNuttallMP"), tweet_ID = c("745870298600316929",  "740663385214324737", "741306107059130368", "742477469983363076",  "743146889596534785"), tweet_length = c(140L, 118L, 140L, 139L,  63L), tweet = c("RT @carolemills77: Many thanks to all the @mkcouncil #EUref staff who are already in the polling stations ready to open at 7am and the Elec",  "RT @BetterOffOut: If you agree with @DanHannanMEP, please RT. #VoteLeave #Brexit #BetterOffOut ",  "RT @iaingartside: Out with @DavidNuttallMP @DeneVernon @CllrSueNuttall Campaigning to \"Leave\" in the EU ref in Bury Today #Brexit https://t",  "RT @simplysimontfa: Don't be distracted by good opinion polls 4 Leave; the only way to get our country back is to maximise the Brexit vote",  "@GrumpyPete Just a little light relief #BetterOffOut #VoteLeave" ), number_hashtags = c(1L, 3L, 1L, 1L, 2L), number_URLs = c(0L,  0L, 0L, 0L, 0L), sentiment_score = c(2L, 2L, -1L, 0L, 0L), stance = c("leave",  "leave", "leave", "leave", "leave")), .Names = c("numRetweets",  "numFavorites", "username", "tweet_ID", "tweet_length", "tweet",  "number_hashtags", "number_URLs", "sentiment_score", "stance" ), row.names = c(NA, 5L), class = "data.frame")

Document term matrix

Scott
  • 1,863
  • 2
  • 24
  • 43
  • 1
    instead of overloading the question with unnecessary code and pictures of the matrices, please edit the question and only include input and output example data using `dput()`. – Roman Sep 28 '17 at 13:24
  • Try `dput(head(df,10))` for the 10 first rows like in your sreenshot. Otherwise it's hard to help you as a simple `merge`, `left_join` approach seems to be straightforward. Please include your attempts with `merge` as well. – Roman Sep 28 '17 at 13:42
  • I've put the full `dput()` output into text files and linked a shared Google Drive folder – Scott Sep 28 '17 at 13:46
  • As google drive is blocked at my computer I can't help you. Plase read https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Sep 28 '17 at 13:50
  • Added `dput()` output in plain text for the first 5 raw data entries. – Scott Sep 28 '17 at 13:55
  • nice, but useless as it is incomplete... – Roman Sep 28 '17 at 14:49

0 Answers0