3

I understand that "cSplit_e" in "splitstackshape" can be used to convert multiple values under one column to separate columns with binary values. I am dealing with a text problem for calculating tf-idf and it is not necassary to have all unique value under a column. e.g.,

docname   ftype                        doc_text
    1      mw               hello, hi, how, are, you, hello
    2      gw                       hi,yo,man
    3      mw                     woha,yo, yoman

DPUT(df)

   structure(list(docname = 1:3, ftype = c("mw", "gw", "mw"), doc_text = structure(1:3, .Label = c("hello, hi, how, are, you, hello", 
"hi,yo,man", "woha,yo, yoman"), class = "factor")), .Names = c("docname", 
"ftype", "doc_text"), class = "data.frame", row.names = c(NA, 
-3L))

For above example, if we consider the doc-1, then cSplit_e will convert doc_text into 5 separate columns having a value of "1" when "hello" appeared twice. Is there a way to modify this function to account for repeated values?

In essence, here is what I am trying to achieve: Given a data frame

docname ftype doc_text 1 mw hello, hi, how, are, you, hello 2 gw hi,yo,man 3 me woha,yo, yoman

I want to conver the doc_text into multiple columns based on column values separated by "," and get their respective frequencies. So the result should be

docname ftype are hello hi how man woha yo yoman you
     1   mw    1     2  1   1   0    0  0     0   1
     2   gw    0     0  1   0   1    0  1     0   0
     3   mw    0     0  0   0   0    1  1     1   0

I will appreciate if someone knows how to accomplish this using "splitstackshape" or by a different way. The eventual aim is to calculate tf-idf.

Thanks.

syebill
  • 543
  • 6
  • 23
  • I'll add an answer when [V2 of "splitstackshape"](https://github.com/mrdwab/splitstackshape/tree/v2.0) is released as `cSplit_e()` has been modified to have a "count" mode now that will do what you expect. – A5C1D2H2I1M1N2O1R2T1 Mar 31 '18 at 11:11

2 Answers2

3

We can do this with mtabulate after splitting by 'doc_text'

library(qdapTools)
cbind(df[1], mtabulate(strsplit(as.character(df$doc_text), ",\\s*")))
#   docname are hello hi how man woha yo yoman you
#1       1   1     2  1   1   0    0  0     0   1
#2       2   0     0  1   0   1    0  1     0   0
#3       3   0     0  0   0   0    1  1     1   0

Or another option is tidyverse

library(tidyverse)
separate_rows(df, doc_text) %>% #split to long format
           group_by(docname, doc_text) %>% #group by variables
           tally() %>% #get the frequency
           spread(doc_text, n, fill=0) #reshape to wide

Or as @Frank suggested

library(splitstackshape)
cSplit(df, "doc_text", ",", "long")[, dcast(.SD, docname ~ doc_text)]
akrun
  • 874,273
  • 37
  • 540
  • 662
2

With a little text-mining:

docs <- gsub('[[:punct:]]+', ' ', as.character(df$doc_text))
library(tm)
corpus <- Corpus(VectorSource(docs))

# compute Term Frequencies
as.matrix(DocumentTermMatrix(corpus, control = list(wordLengths=c(2,Inf))))
#     Terms
#Docs are hello hi how man woha yo yoman you
#   1   1     2  1   1   0    0  0     0   1
#   2   0     0  1   0   1    0  1     0   0
#   3   0     0  0   0   0    1  1     1   0

# compute Tf-Idf scores
as.matrix(DocumentTermMatrix(corpus, control = list(wordLengths=c(2,Inf), weighting=weightTfIdf)))
#         Terms
#Docs       are     hello         hi       how       man      woha        yo     yoman`       you
#   1 0.2641604 0.5283208 0.09749375 0.2641604 0.0000000 0.0000000 0.0000000 0.0000000 0.2641604
#   2 0.0000000 0.0000000 0.19498750 0.0000000 0.5283208 0.0000000 0.1949875 0.0000000 0.0000000
#   3 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.5283208 0.1949875 0.5283208 0.0000000
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • Judging by your first line, this would treat multiword values like "hello world" as separate values. If so, you might want to mention that caveat. – Frank Feb 23 '17 at 18:06
  • 2
    @Frank yes it's `bag of words` representation, so not considering `n-grams` for `n>1` e.g., phrases. – Sandipan Dey Feb 23 '17 at 18:09
  • @Sandipan. I can see that you have taken out the punctuation to remove the "," but what if the text has meaningful punctuation that should form part of the words? I should omit the first step and follow the rest? Also can you shed light on "wordLengths = c(2,Inf)". Is it used for specifying a minimum and maximum word length in the documents? – syebill Feb 24 '17 at 20:56
  • `2,Inf` are min and max wordlengths (by default DocumentTermMatrix ignores all words shorter than length 3). Also, it's a standard practice to remove punctuation and strip whitespace before DocumentTermMatrix computation. If you skip the first line, you may get words with extra spaces and if you remove punctuation as text-preprocessing step, then you may get words concatenated together, which may be undesirable. – Sandipan Dey Feb 24 '17 at 21:28