I'm analyzing a data frame of product reviews that contain some empty entries or text written in foreign language. The data also contain some customer attributes which can be used as "features" in later analysis.
To begin with, I will first convert the reviews
column into DocumentTermMatrix
and then convert it to lda
format, I then plan to throw in the documents
and vocab
objects generated from the lda process along with selected columns from the original data frame into stm
's prepDocuments()
function such that I can leverage the more versatile estimation functions from that package, using customer attributes as features to predict topic salience.
However, because some of the empty cells, punctuation, and foreign characters might be removed during the pre-processing and thereby creating some character(0)
rows in the lda's documents
object, making those reviews unable to match their corresponding rows in the original data frame. Eventually, this will prevent me from generating the desired stm
object from prepDocuments()
.
Methods to remove empty documents certainly exist (such as the methods recommended in this previous thread), but I am wondering if there're ways to also remove the rows correspond to the empty documents from the original data frame such that the number of lda documents
and the row dimension of the data frame that will be used as meta
in the stm
functions are aligned? Will indexing help?
Part of my data is listed at below.
df = data.frame(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "@#haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))