0

I am trying to use wordscores on a corpus but when I use the "mv" rescaling the code fails to set as reference texts the ones I have selected. Besides, even though I establish -1 and 1 as reference values, it goes beyond them when rescaling. It works fine with the "lbg" rescaling though. I would like the -1 value to be allocated to "1999_St_CON" and the 1 to "1999_St_SNP". While it works for the former, it does not for the second, and allocates it instead to "1999_St_FAW", the second of the subsetted corpus. Thanks.

Here is the code:

# load 
library(quanteda)    
require(readtext)
library(stringr)    
library(dplyr)    
library(tidyr)    
library(stringr)    
library(rowr)    

###Load all general debates
DPG <- readtext("~Parliamentary session/CP/*.txt", encoding="utf-8")    
DPG

DPGcorp <- corpus(DPG)    
docnames(DPGcorp) <- (DPG$doc_id)#change the names of the documents extracting the text from the default column created by quanteda    
summary(DPGcorp)

###Create a new docvar (create a new variable for the document, the party variable)    
docvars(DPGcorp, "Year") <- substring(names(texts(DPGcorp)),1,4)
docvars(DPGcorp, "Party") <- substring(names(texts(DPGcorp)),9,11)    
summary(DPGcorp)

#wordscores    
corpus1999 <- corpus_subset(DPGcorp, Year==1999)#select year 1999
summary(corpus1999)    
dfm1999 <- dfm(corpus1999, stem = TRUE, remove = stopwords("english"), remove_punct = TRUE)    
head(dfm1999)

#Reference scores    
refscores <- rep(NA,nrow(dfm1999))#repeat NA for the number of rows of the dfm    
refscores[str_detect(rownames(dfm1999), "1999_St_CON")] <- -1    
refscores[str_detect(rownames(dfm1999), "1999_St_SNP")] <- 1


#Wordscore model    
ws1999 <- textmodel_wordscores(dfm1999, refscores, scale="linear", smooth=1)    
ws1999

wordscore1999 <- predict(ws1999, rescaling="mv")    
wordscore1999


#Writing the results into data frame    
ws.1999 <- data.frame(cbind(docvars(corpus1999),
                            wordscore1999))    
ws.1999    
ws.1999 <- dplyr::rename(ws.1999, wscore = wordscore1999)    
ws.1999

Here is the output:

 > corpus1999 <- corpus_subset(DPGcorp, Year==1999)
 > summary(corpus1999)
 Corpus consisting of 7 documents:

        Text Types Tokens Sentences          doc_id Year Party
 1999_St_CON.txt   390    948        32 1999_St_CON.txt 1999   CON
 1999_St_FAW.txt   181    394        16 1999_St_FAW.txt 1999   FAW
 1999_St_GOV.txt   560   2126        84 1999_St_GOV.txt 1999   GOV
 1999_St_LAB.txt   289    747        36 1999_St_LAB.txt 1999   LAB
 1999_St_LIB.txt   258    640        26 1999_St_LIB.txt 1999   LIB
 1999_St_SNP.txt   393   1201        41 1999_St_SNP.txt 1999   SNP
 1999_St_SSP.txt   278    632        25 1999_St_SSP.txt 1999   SSP


 > 
 > dfm1999 <- dfm(corpus1999, stem = TRUE, remove = stopwords("english"), 
 remove_punct = TRUE)
 > head(dfm1999)
 Document-feature matrix of: 6 documents, 939 features (75.6% sparse).
 > 
 > #Reference scores
 > refscores <- rep(NA,nrow(dfm1999))#repeat NA for the number of rows of 
 the dfm
 > 
 > refscores[str_detect(rownames(dfm1999), "1999_St_CON")] <- -1
 > refscores[str_detect(rownames(dfm1999), "1999_St_SNP")] <- 1
 > 
 > #Wordscore model
 > ws1999 <- textmodel_wordscores(dfm1999, refscores, scale="linear", 
 smooth=1)
 > ws1999

 Call:
 textmodel_wordscores.dfm(x = dfm1999, y = refscores, scale = "linear", 
 smooth = 1)

 Scale: linear; 2 reference scores; 939 scored features.
 > wordscore1999 <- predict(ws1999, rescaling="mv")
 > wordscore1999
 1999_St_CON.txt 1999_St_FAW.txt 1999_St_GOV.txt 1999_St_LAB.txt 
 -1.0000000       1.0000000       0.7614462       1.3593657 
 1999_St_LIB.txt 1999_St_SNP.txt 1999_St_SSP.txt 
  1.0536728       3.5124870       0.9350710 
> 
> #Writing the results into data frame
> ws.1999 <- data.frame(cbind(docvars(corpus1999),
+                             wordscore1999))
> ws.1999
                     doc_id Year Party wordscore1999
1999_St_CON.txt 1999_St_CON.txt 1999   CON    -1.0000000
1999_St_FAW.txt 1999_St_FAW.txt 1999   FAW     1.0000000
1999_St_GOV.txt 1999_St_GOV.txt 1999   GOV     0.7614462
1999_St_LAB.txt 1999_St_LAB.txt 1999   LAB     1.3593657
1999_St_LIB.txt 1999_St_LIB.txt 1999   LIB     1.0536728
1999_St_SNP.txt 1999_St_SNP.txt 1999   SNP     3.5124870
1999_St_SSP.txt 1999_St_SSP.txt 1999   SSP     0.9350710
> 
> ws.1999 <- dplyr::rename(ws.1999, wscore = wordscore1999)
> ws.1999
                     doc_id Year Party     wscore
1999_St_CON.txt 1999_St_CON.txt 1999   CON -1.0000000
1999_St_FAW.txt 1999_St_FAW.txt 1999   FAW  1.0000000
1999_St_GOV.txt 1999_St_GOV.txt 1999   GOV  0.7614462
1999_St_LAB.txt 1999_St_LAB.txt 1999   LAB  1.3593657
1999_St_LIB.txt 1999_St_LIB.txt 1999   LIB  1.0536728
1999_St_SNP.txt 1999_St_SNP.txt 1999   SNP  3.5124870
1999_St_SSP.txt 1999_St_SSP.txt 1999   SSP  0.9350710
> 
Ion
  • 1
  • 1
  • 2
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. What exactly is failing here? Is there an error message or something? – MrFlick Feb 27 '18 at 15:50
  • Thank you, I just added the output. There are no error messages, I just do not understand why the 1 value previously set is allocated to the text of the party FAW and not to the party SNP. – Ion Feb 27 '18 at 16:07
  • You found a bug (thanks!), for which I have opened [an issue](https://github.com/quanteda/quanteda/issues/1251) and am solving it now. Please use https://github.com/quanteda/quanteda/issues for apparent bugs and SO for how-to questions. (But I appreciate that the difference is not always clear.) – Ken Benoit Feb 27 '18 at 16:24
  • I'm voting to close this question as off-topic because it's a bug report not a how-to question. As I indicated in the comments, I am solving the issue on GitHub. – Ken Benoit Feb 27 '18 at 16:25

0 Answers0