How to use bag-of-words with one-hot encoding on txt file?

Question

EDIT: If there's a better way to ask this question or anything I should articulate to facilitate answers, please let me know. Thanks!

I’m trying to integrate one-hot encoding into my R code so I can conduct text mining on txt files, but I’m running into some errors. If I try to perform one-hot encoding first, then try to clean the data with bag-of-words, I get an error that it can’t coerce the TDM to a data frame and thus can’t remove sparsity. If I try to clean the data first, then add one-not encoding, I get the same TDM to data frame error. Do you know how I can get around this? I'm pretty new to R, so I can use all the help I can get.

Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class ‘c("TermDocumentMatrix", "simple_triplet_matrix")’ to a data.frame

Here's a reproducible example (thanks @william3031):

library(tm)
library(tidyverse)
library(visNetwork)
library(broom)
library(SnowballC)
library(tidytext)

path = box::file()
main_dir = (dirname(path))
data_dir = paste0(main_dir, "/data")
plot_dir = paste0(main_dir, "/plots")


#Import data
CNN_HDW_data <- read.delim(file.choose(), #file.choose() enables file browsing
                   header = FALSE, # Because no header in data
                   sep = '') # Different vectors indicated by spaces

view(CNN_HDW_data) # success!
head(CNN_HDW_data)
tail(CNN_HDW_data)

CNN_HDW_data
tail(CNN_HDW_data)

#' MAKE THE DATA A SOURCE
CNN_HDW_source <- VectorSource(CNN_HDW_data)

# Make a volatile corpus
CNN_HDW_corpus <- VCorpus(CNN_HDW_source)
# Print out the corpus
CNN_HDW_corpus

# Convert `tolower` to `tm`-compatible function.
TmTolower = content_transformer(tolower)

# Convert all characters in `CNN_HDW_corpus` to lower case.
CNN_HDW_clean = tm_map(CNN_HDW_corpus, #<- corpus object to be cleaned
                           TmTolower)
CNN_HDW_clean[[15]][1]

head(CNN_HDW_clean)

# Remove punctuation from all documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
                           removePunctuation)

CNN_HDW_clean[[15]][1]

# Remove numbers from all documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
                           removeNumbers)

# Replace all BUT letters with an empty string.
SubstitutePattern = content_transformer(
  function(document, pattern, replacement){ #<- notice the 1st argument has to be the document
    gsub(pattern, 
         replacement,       
         document)
  }
)

CNN_HDW_clean = tm_map(CNN_HDW_clean,
                           SubstitutePattern,
                           "[^A-z]", #<- provide pattern to replace
                           " ")      #<- provide replacement

# Remove punctuation from all documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
                           SubstitutePattern,
                           "mathbf",
                           " ")

# Remove punctuation from all documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
                           SubstitutePattern,
                           "\\s[A-z]{1,2}\\s",
                           " ")

CNN_HDW_clean[[15]][1]

help(keep)

# Define remove words function that takes a single argument.
RemoveEnglishWords = function(document){ #<- take a single document argument
  removeWords(document,               #<- remove words in the document
              stopwords("english"))   #<- give function a vector of common English stopwords
}

# Apply this transformation to the entire corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,  #<- set corpus
                           RemoveEnglishWords) #<- set transformation

CNN_HDW_clean[[15]][1]

# Stem documents in corpus.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
                           stemDocument)

# Substitute all leading and trailing whitespace with an empty string.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
                           SubstitutePattern,
                           "^\\s+|\\s+$",
                           "")

# Strip all whitespace between words.
CNN_HDW_clean = tm_map(CNN_HDW_clean,
                           stripWhitespace)

# Construct a term document matrix.
CNN_HDW_TDM = TermDocumentMatrix(CNN_HDW_clean)
CNN_HDW_TDM

# Remove sparse terms from a TDM.
CNN_HDW_TDM = removeSparseTerms(CNN_HDW_TDM,     
                             sparse = 0.75) 
CNN_HDW_TDM

CNN_HDW_TDM[[1]][1]

rownames(CNN_HDW_TDM)
colnames(CNN_HDW_TDM)

#create data frame
DF_CNN_HDW3 <- data.frame(CNN_HDW_TDM)
DF_CNN_HDW3


#define one-hot encoding function
dummy_CNN_HDW3 <- dummyVars(" ~ .", 
                            data=DF_CNN_HDW3)

#perform one-hot encoding on data frame
final_DF_CNN_HDW3 <- data.frame(predict(dummy_CNN_HDW3, 
                                        newdata=DF_CNN_HDW3))

#view final data frame
view(final_DF_CNN_HDW3)

Hi, can you provide a reproducible example? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — william3031, Mar 17 '23 at 00:35

How to use bag-of-words with one-hot encoding on txt file?

0 Answers0