I am working on a dataframe consists of a collection of social media post. After parsing, stemming, and cleaning the text column from that dataframe, I want to convert the output (mylist
, which is a list of lists) back to the original metadata (mydf
) to remove rows (from mydf
) where the parsed/cleaned text columns have zero character length (i.e., character(0)
).
I have referenced some previous posts (1, 2), but given that my data contain several foreign language posts (e.g., row 6) whose text are segmented differently and are returned as a list of concatenated string objects, hence, the approaches recommend by 1 didn't work because R had a hard time determining where that Chinese sentence ends.
Part of my data are provided at below. It will be highly appreciated if someone could shed light on this.
# part of the data
mydf <- data.frame(document = c("I want an apple", "//:", "This is a dog", "Suppose that...", "@%!!", "半夜快笑死"),
id = c(1, 2, 3, 4, 5, 6), gender = c("M", "F", "M", "M", "F", "?"), source = c("Facebook", "Facebook", "Twitter", "Facebook", "Twitter", "Weibo"))
# the parsed/stemmed text output
mylist <- list()
mylist[1] = "i want an apple"
mylist[2] = list(character(0))
mylist[3] = "this is a dog"
mylist[4] = "suppose that"
mylist[5] = list(character(0))
mylist[6] = list(c("半夜", "快", "笑死"))
mylist
# I want to delete rows from mydf where their correspondng text has zero character length on mylist