0

using R programming

I have two sets of data (securityj and securityc). I want to find the cosine similarity value between them

I used this code using the lsa library

databasfile = tempfile()
dir.create(databasfile)
write( databasej, file=paste(databasfile, "D1", sep="/"))
write( databasec, file=paste(databasfile, "D2", sep="/"))
myMatrix = textmatrix(databasfile)

databaseRes <- lsa::cosine(myMatrix[,1], myMatrix[,2])

securityfile = tempfile()
dir.create(securityfile)

write( securityj, file=paste(securityfile, "D1", sep="/"))
write( securityc, file=paste(securityfile, "D2", sep="/"))
securityMatrix = textmatrix(securityfile)

securityRes <- lsa::cosine(securityMatrix[,1], securityMatrix[,2])

I get this error when running (textmatrix(securityfile))

Error in FUN(X[[i]], ...) : [lsa] - could not open file C:\Users\AAA\AppData\Local\Temp\RtmpIDmcl7\file1898438fde2/D1 due to encoding problems of the file.

when dealing with databasfile it goes very well, but with the securityfile I have error, and the data is taken from the same original file. The thing is that I create the file then read it immediately. I tried to change the original file encoding and make sure it is UTF-8 but nothing changed

textmatrixis a function in lsa library. and my data is two lists of bigrams taken from cleaned job ads, both (databasej ,databasec) and (securityj,securityc) came from the same text file, it worked in the first but i get error in the second. and for separator sep="/" , it's the same as the function wants in the documentation.

sample input in securityj

 [333] "risk assessment"               "beginning darkmatter"         
 [335] "best practices"                "create dream"                 
 [337] "darkmatter agile"              "darkmatter bring"             
 [339] "darkmatter impossible"         "darkmatter place"             
 [341] "drive lead"                    "education drive"              
 [343] "experience education"          "forensic analysis"            
 [345] "freedom create"                "knowledge network"            
 [347] "lead missing"                  "missing freedom"              
 [349] "offers personal"               "perl python"                  
 [351] "related security"              "security risks"               
 [353] "standard operating"            "windows linux"                
 [355] "security controls"             "systems security"             
 [357] "advice guidance"               "application penetration"      
 [359] "certified information"         "forensics malware"            
 [361] "guidance areas"                "networks applications"        
 [363] "new era"                       "practice advice"              
 [365] "provisioning best"             "security certified"           
 [367] "web application"               "government oil"               
 [369] "kill chain"                    "network based"                
 [371] "risk assessments"              "technical experience"         
 [373] "audit compliance"              "business units"               
  • 1
    Where does the function `textmatrix` come from? When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 05 '18 at 21:11
  • i edited the question – Ahmed Mohammed Mar 06 '18 at 03:41

2 Answers2

0

It's very difficult to evaluate this question without a reproducible example, including the source for what looks like a user defined function textmatrix.

The only thing that jumps out at me is that the files you're creating are very odd. You're creating a valid but random directory, then it looks like you're trying to place two files in that directory with the wrong separator (your file separator is a back slash, and you're trying to add files in a directory using a forward slash). Depending on what testmatrix is (what it does to the character vector argument you passed to it) and what the structure of databasej and databasec are, it may be able to make sense of the file in the database case but not the security case. But this is guesswork without a reproducible example. You could try to use a platform independent file separator with the builtin variable .Platform$file.sep, or if you're just running this locally, match it to your file, separator, which is \ rather than /. If that works, then hooray. If not, try writing a reproducible example and you may get better help~

De Novo
  • 7,120
  • 1
  • 23
  • 39
  • i said that i am using lsa library to compute the cosine similarity, textmatrix is a function in this library. – Ahmed Mohammed Mar 06 '18 at 03:52
  • Could you please include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – De Novo Mar 06 '18 at 03:55
  • So it appears textmatrix expects a directory with files in it. I can't tell you why it's able to parse the database files, since I don't know what they are, but I can tell you it's unsurprising that it can't parse the security files since they don't use the correct file separator. Did you try it with the correct file separator for your platform? – De Novo Mar 06 '18 at 03:59
  • i said that i am using lsa library to compute the cosine similarity, `textmatrix`is a function in this library. and my data is two lists of bigrams taken from cleaned job ads, the weird thing is that both (databasej ,databasec) and (securityj,securityc) came from the same text file, it worked in the first but i get error in the second. and for separator, it's the same as the function wants in the documentation. what details are needed more? – Ahmed Mohammed Mar 06 '18 at 03:59
0

I changed the file encoding to ANSI, and it worked