0

I have a very large corpus/DFM/DTM object that I want to compute the linguistic similarity for. However, the object is too large, so every time I try to compute the cosine similarity statistic, R shuts down. This is what I'm using to calculate the cosine similarity scores:

test_cosine <- textstat_simil(myTFIDF, margin = "terms", method = "cosine") #calculate cosine

But the thing is, I'm really only interested in the scores for a few terms. For example, I want to look at the cosine similarity scores of other terms to "conservative" and to "liberal". Is there a way I can tell R to produce the similarity scores for terms against just those two terms? I saw a post on here that recommended another way of computing the cosine similarity between two terms ( d <- stringdist("conserv","liberal",method="cosine")). This one did produce a score, but I was confused by this one, as it didn't ask me to specify the data, so I don't know how it is computing the score.

If not, is there another way to get the rhetorical similarity scores for terms from a large corpus/DTM/DFM object?

EDIT*** Here is the dput for a small subset of the data.

structure(list(`twitterdata[1:50, ]` = c("matter time", "beatl", 
"craze left wing hippi weirdo freak rant observ", " officialmonstax", 
"bienvenido club fan dedicado informar apoyar hermosa talentosa estrella mexicana angelicaval sigueno", 
"boa lagoa", "forget thing hurt lesson learn make mistak can never never regret thing made smile", 
"vynox come soon", "offici th parkad downtown sandiego locat next petco park histor gaslamp quarter", 
"offici salvat armi chicago metropolitan divis largest direct provid social servic state illinoi", 
"jackson ude journalist skill practition field polit communic media public polici polit manag", 
"encourag motiv inspir lead execut", " negat world", "seattl bremerton elliot thorsen ethorsen", 
"laxxxxx", "west geauga high school hockey", "baltimor born rais dundalk", 
"eight thirti thirteen", "talk show produc newstalk humber colleg radio broadcast graduat dolphin magic blue jay aggi mapl leaf hotspur fan", 
"sdsu ", "read book host literari lair fix can also found gomer product now", 
"va dancer yrs old basketbal player volleybal setter yes may loser best damn loser ever meet", 
"can star magic workshopp", "offici page strongsvill ladi mustang varsiti soccer team", 
"queen hous wife mom sis aunt garden chef caregiv teacher counselor pro life liberti happi godfear christfollow biblebeliev john ", 
"sport star war hous music", "alexandria atletico jenni tcw ", 
"fun guitar", "jesus dont give fail everyth mean noth realiz good bro aredhel nargothrond elf name", 
"life give lemon return ask zayn malik pleas ", "collector thing beauti past present find rebelmous vintagedressparlour", 
"streamer aspir musician fit health", "artist illustr design d model busi commiss open charact simpl background", 
"musico poeta loco artista naturaleza", "totalment fascinado afeccion emocional biologica cuerpo humano eterno enamorado letra lavida dio guia hoy manana siempr", 
"girl mani dream shawnmend girlfriend dream", "alway look delici bigup friend", 
"sassi sexi wild lover music writer blogger product junki pierc tattoo enthusiast hopeless romant makeup artist", 
"stop useless start pizza", "keep upto date fixtur result latest news updat across gfa leagu", 
"ez lab onlin portal various nabl iso certifi diagnost lab avail provid qualiti assur healthcar consum", 
" that flick tho", " pinch dinosaurio risueno guey music drug physiotherapi campus puebla", 
"look sharp cut edg design get", "proud eph gopher track alum bs kin umn mba sp mgt cuc ski racer climber around athlet ao", 
"professor nerdi abound warn fond book turn brain", "gotta risk get biscuit mdp presleyy aspir sing avocado ladi", 
"dragonapothek ist onlin apothek allgemein dieser bieten manner sexuell gesundheit medizin kamagra", 
"help compani individu discov fit clariti therapist connect agent outgo introvert flaw believ husband dad bbq er", 
"keep negat aliv babi termin hate spread posit cudfam cudlif"
)), row.names = c(NA, -50L), class = "data.frame") ```
LMc
  • 12,577
  • 3
  • 31
  • 43
lwe
  • 323
  • 1
  • 8
  • Please read about [how to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and update your question accordingly. Include a sample of your data by pasting the output of `dput()` into your post or `dput(head())` if you have a large data frame. Also include code you have tried, any relevant errors, and expected output. If you cannot post your data, then post code for creating representative data. [Do not post images of code and/or data.](https://meta.stackoverflow.com/a/285557/6382434). – LMc Aug 25 '23 at 20:25
  • Cosine similarity and other string distance algorithms are used to calculate a similarity of two strings **not** their concepts. If you calculated the cosine similarity of "conservative" versus "liberal" you might find they are not similar. This is because they do not similar characters and orders of characters. – LMc Aug 25 '23 at 21:07
  • For example, `stringdist::stringsim("identified", "unidentified", method = "cosine")` computes a high similarity score (0.959), despite these two words having opposite meanings. This is because they share many of the same characters in similar positions. – LMc Aug 25 '23 at 21:08
  • @LMc, I understand now about sting similarity, but cosine similarity is different from string similarity and, at least traditionally, is not measured based on the characters or the position of the characters, but rather the relationship of terms, based on their co-occurance across documents in a corpus. https://www.sciencedirect.com/topics/computer-science/cosine-similarity#:~:text=2.4.&text=Cosine%20similarity%20measures%20the%20similarity,document%20similarity%20in%20text%20analysis. – lwe Aug 25 '23 at 21:09
  • I appreciate your clarification on the stringdist and stringsim command, though. That's helpful. – lwe Aug 25 '23 at 21:10
  • Yes, I was using a simple example, but even as you mention the "co-occurrence across documents." For comparing to words such as "liberal" and "conservative" a similarity is computed based on the co-occurrence of characters. You may expand this concept to an entire document to produce a similarity score. – LMc Aug 25 '23 at 21:12

1 Answers1

0

Pass a DFM with selected columns to y:

textstat_simil(x = dfmt, y = dfmt[,c("conservative", "liberal")])
Kohei Watanabe
  • 750
  • 3
  • 6