I understand that "cSplit_e" in "splitstackshape" can be used to convert multiple values under one column to separate columns with binary values. I am dealing with a text problem for calculating tf-idf and it is not necassary to have all unique value under a column. e.g.,
docname ftype doc_text
1 mw hello, hi, how, are, you, hello
2 gw hi,yo,man
3 mw woha,yo, yoman
DPUT(df)
structure(list(docname = 1:3, ftype = c("mw", "gw", "mw"), doc_text = structure(1:3, .Label = c("hello, hi, how, are, you, hello",
"hi,yo,man", "woha,yo, yoman"), class = "factor")), .Names = c("docname",
"ftype", "doc_text"), class = "data.frame", row.names = c(NA,
-3L))
For above example, if we consider the doc-1, then cSplit_e will convert doc_text into 5 separate columns having a value of "1" when "hello" appeared twice. Is there a way to modify this function to account for repeated values?
In essence, here is what I am trying to achieve: Given a data frame
docname ftype doc_text 1 mw hello, hi, how, are, you, hello 2 gw hi,yo,man 3 me woha,yo, yoman
I want to conver the doc_text into multiple columns based on column values separated by "," and get their respective frequencies. So the result should be
docname ftype are hello hi how man woha yo yoman you
1 mw 1 2 1 1 0 0 0 0 1
2 gw 0 0 1 0 1 0 1 0 0
3 mw 0 0 0 0 0 1 1 1 0
I will appreciate if someone knows how to accomplish this using "splitstackshape" or by a different way. The eventual aim is to calculate tf-idf.
Thanks.