EDIT: If there's a better way to ask this question or anything I should articulate to facilitate answers, please let me know. Thanks!
I’m trying to integrate one-hot encoding into my R code so I can conduct text mining on txt files, but I’m running into some errors. If I try to perform one-hot encoding first, then try to clean the data with bag-of-words, I get an error that it can’t coerce the TDM to a data frame and thus can’t remove sparsity. If I try to clean the data first, then add one-not encoding, I get the same TDM to data frame error. Do you know how I can get around this? I'm pretty new to R, so I can use all the help I can get.
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘c("TermDocumentMatrix", "simple_triplet_matrix")’ to a data.frame
Here's a reproducible example (thanks @william3031):
library(tm)
library(tidyverse)
library(visNetwork)
library(broom)
library(SnowballC)
library(tidytext)
path = box::file()
main_dir = (dirname(path))
data_dir = paste0(main_dir, "/data")
plot_dir = paste0(main_dir, "/plots")
#Import data
CNN_HDW_data <- read.delim(file.choose(), #file.choose() enables file browsing
header = FALSE, # Because no header in data
sep = '') # Different vectors indicated by spaces
view(CNN_HDW_data) # success!
head(CNN_HDW_data)
tail(CNN_HDW_data)
CNN_HDW_data
tail(CNN_HDW_data)
#' MAKE THE DATA A SOURCE
CNN_HDW_source <- VectorSource(CNN_HDW_data)
# Make a volatile corpus
CNN_HDW_corpus <- VCorpus(CNN_HDW_source)
# Print out the corpus
CNN_HDW_corpus
# Convert `tolower` to `tm`-compatible function.
TmTolower = content_transformer(tolower)
# Convert all characters in `CNN_HDW_corpus` to lower case.
CNN_HDW_clean = tm_map(CNN_HDW_corpus, #<- corpus object to be cleaned
TmTolower)
CNN_HDW_clean[[15]][1]
head(CNN_HDW_clean)
# Remove punctuation from all documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
removePunctuation)
CNN_HDW_clean[[15]][1]
# Remove numbers from all documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
removeNumbers)
# Replace all BUT letters with an empty string.
SubstitutePattern = content_transformer(
function(document, pattern, replacement){ #<- notice the 1st argument has to be the document
gsub(pattern,
replacement,
document)
}
)
CNN_HDW_clean = tm_map(CNN_HDW_clean,
SubstitutePattern,
"[^A-z]", #<- provide pattern to replace
" ") #<- provide replacement
# Remove punctuation from all documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
SubstitutePattern,
"mathbf",
" ")
# Remove punctuation from all documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
SubstitutePattern,
"\\s[A-z]{1,2}\\s",
" ")
CNN_HDW_clean[[15]][1]
help(keep)
# Define remove words function that takes a single argument.
RemoveEnglishWords = function(document){ #<- take a single document argument
removeWords(document, #<- remove words in the document
stopwords("english")) #<- give function a vector of common English stopwords
}
# Apply this transformation to the entire corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean, #<- set corpus
RemoveEnglishWords) #<- set transformation
CNN_HDW_clean[[15]][1]
# Stem documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
stemDocument)
# Substitute all leading and trailing whitespace with an empty string.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
SubstitutePattern,
"^\\s+|\\s+$",
"")
# Strip all whitespace between words.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
stripWhitespace)
# Construct a term document matrix.
CNN_HDW_TDM = TermDocumentMatrix(CNN_HDW_clean)
CNN_HDW_TDM
# Remove sparse terms from a TDM.
CNN_HDW_TDM = removeSparseTerms(CNN_HDW_TDM,
sparse = 0.75)
CNN_HDW_TDM
CNN_HDW_TDM[[1]][1]
rownames(CNN_HDW_TDM)
colnames(CNN_HDW_TDM)
#create data frame
DF_CNN_HDW3 <- data.frame(CNN_HDW_TDM)
DF_CNN_HDW3
#define one-hot encoding function
dummy_CNN_HDW3 <- dummyVars(" ~ .",
data=DF_CNN_HDW3)
#perform one-hot encoding on data frame
final_DF_CNN_HDW3 <- data.frame(predict(dummy_CNN_HDW3,
newdata=DF_CNN_HDW3))
#view final data frame
view(final_DF_CNN_HDW3)