0

Edited to shorten and provide sample data.

I have text data consisting of 8 questions asked of a number of participants twice. I want to use text2vec to compare the similarity of their responses to these questions at the two points in time (duplicate detection). Here is how my initial data is structured (in this example there are just 3 participants, 4 questions instead of 8, and 2 quarters/time periods). I want to do similarity comparison for each participant's response in the first quarter vs. the second quarter. I intend to use package text2vec's psim command to do this.

df<-read.table(text="ID,Quarter,Question,Answertext
               Joy,1,And another question,adsfjasljsdaf jkldfjkl
               Joy,2,And another question,dsadsj jlijsad jkldf 
               Paul,1,And another question,adsfj aslj sd afs dfj ksdf
               Paul,2,And another question,dsadsj jlijsad
               Greg,1,And another question,adsfjasljsdaf
               Greg,2,And another question, asddsf asdfasd sdfasfsdf
               Joy,1,this is the first question that was asked,this is joys answer to this question
               Joy,2,this is the first question that was asked,this is joys answer to this question
               Paul,1,this is the first question that was asked,this is Pauls answer to this question
               Paul,2,this is the first question that was asked,Pauls answer is different 
               Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
               Greg,2,this is the first question that was asked,this is Gregs answer to this question
               Joy,1,This is the text of another question,more random text
               Joy,2,This is the text of another question, adkjjlj;ds sdafd
               Paul,1,This is the text of another question,more random text
               Paul,2,This is the text of another question, adkjjlj;ds sdafd
               Greg,1,This is the text of another question,more random text
               Greg,2,This is the text of another question,sdaf asdfasd asdff
               Joy,1,this was asked second.,some random text
               Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
               Paul,1,this was asked second.,some random text
               Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
               Greg,1,this was asked second.,some random text
               Greg,2,this was asked second.,ada dasdffasdf asdf  asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')

I've done some more thinking and I believe the right approach is to split the dataframe into a list of dataframes, not separate items.

questlist<-split(df,f=df$Question)

then write a function to create the vocabulary for each question.

library(text2vec)

vocabmkr<-function(x) { itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer() }

test<-lapply(questlist, vocabmkr)

But then I think I need to split the original dataframe into question-quarter combinations and apply the vocab from the other list to it and am not sure how to go about that.

Ultimately, I want a similarity score telling me if the participants are duplicating some or all of their responses from the first and second quarters.

EDIT: Here is how I would do this for a single question starting with the above dataframe.

quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()

quest1q1<-filter(quest1,Quarter==1)
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1

quest1q2<-filter(quest1,Quarter==2) 
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2

#now apply the vocabulary to the two matrices
quest1q1<-create_dtm(quest1q1,quest1vocab)
quest1q2<-create_dtm(quest1q2,quest1vocab)

similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.

b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
endproduct<-full_join(b,quest1)

Edit: Ok, I have worked with the lapply some more.

df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question

vocabmkr<-function(x) {
  itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
}

vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.

dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)

How do I apply the vocab list (consisting of 4 elements) to the dfqq list (consisting of 8)?

Will Hauser
  • 197
  • 7
  • 1
    Welcome to SO! Instead, if you can provide input data frame example, and expected output, it would be easier for us to answer. Such long questions don't usually get quick answers, too much to read. Read here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – YOLO Jul 17 '18 at 19:27
  • I've edited the initial question and provided a sample dataframe to work with. Thanks. – Will Hauser Jul 17 '18 at 20:36
  • So you want to calculate the similarity of the answers between q1 and q2 for each user. right? – YOLO Jul 17 '18 at 20:44
  • Once again, `by` is ignored for `split` + `lapply`! – Parfait Jul 17 '18 at 20:45
  • You may need to show how you intend to compare similarity. Consider showing the long way with example from posted data. – Parfait Jul 17 '18 at 20:51
  • YOLO - in the sample data each user would have 4 scores since there are 4 questions. Each score indicates the similarity of that participants response in q1 vs. q2. Parfait - I'm not sure what the comment about 'by' means. There is no `by` statement in there unless it's implicit in lapply. The key issue is that I need a vocabulary that is shared across quarters for each question. Perhaps it could be shared across all the questions, I don't know. Then I need to tokenize each question-quarter combination and apply the vocabulary to it. Then it's just a simple psim command on the matrix. – Will Hauser Jul 17 '18 at 21:00
  • If it helps, I followed this exactly after filtering to just 1 question. http://text2vec.org/similarity.html#jaccard_similarity the issue is that I want to do this to all 8 questions and this results in a ton of dataframes and redundant lines of code. There must be a way to do this using loops but I can't figure it out. – Will Hauser Jul 17 '18 at 21:05
  • I added the complete code to calculate the similarity scores for question 1. I'd like to be able to do all the questions in a loop (remember, in the actual data there are 8, not 4). Doing it as I did above I end up with a ton of duplicated lines of code, typos that are hard to track down, and a ton of items in the global environment used for the steps along the way. – Will Hauser Jul 17 '18 at 22:00

2 Answers2

1

I'm sorry, that sounds frustrating. In case you have more to do and did want a more automatic way to do it, here's one approach that might work for you:

First, convert your example code for a single dataframe into a function:

analyze_vocab <- function(df_) {
  quest1vocab =
    itoken(as.character(df_$Answertext), ids = df_$ID) %>%
    create_vocabulary() %>%
    prune_vocabulary(term_count_min = 1) %>%
    vocab_vectorizer()

  quarter1 = filter(df_, Quarter == 1)
  quarter1 = itoken(as.character(quarter1$Answertext), 
                    ids = quarter1$ID)

  quarter2 = filter(df_, Quarter == 2)
  quarter2 = itoken(as.character(quarter2$Answertext),
                    ids = quarter2$ID)

  q1mat = create_dtm(quarter1, quest1vocab)
  q2mat = create_dtm(quarter2, quest1vocab)

  similarity = psim2(q1mat, q2mat, method = "jaccard", norm = "none")

  b = data.frame(
    ID = names(similarity),
    Similarity = similarity)

  output <- full_join(b, df_)
  return(output)
}

Now, you can split if you want and then use lapply like this: lapply(split(df, df$Question), analyze_vocab). However, you already seem comfortable with piping so you might as well go with that approach:

similarity_df <- df %>% 
  group_by(Question) %>%
  do(analyze_vocab(.))

Output:

> head(similarity_df, 12)
# A tibble: 12 x 5
# Groups:   Question [2]
   ID    Similarity Quarter Question                                  Answertext                                           
   <fct>      <dbl>   <int> <fct>                                     <fct>                                                
 1 Joy        0           1 And another question                      adsfjasljsdaf jkldfjkl                               
 2 Joy        0           2 And another question                      "dsadsj jlijsad jkldf "                              
 3 Paul       0           1 And another question                      adsfj aslj sd afs dfj ksdf                           
 4 Paul       0           2 And another question                      dsadsj jlijsad                                       
 5 Greg       0           1 And another question                      adsfjasljsdaf                                        
 6 Greg       0           2 And another question                      " asddsf asdfasd sdfasfsdf"                          
 7 Joy        1           1 this is the first question that was asked this is joys answer to this question                 
 8 Joy        1           2 this is the first question that was asked this is joys answer to this question                 
 9 Paul       0.429       1 this is the first question that was asked this is Pauls answer to this question                
10 Paul       0.429       2 this is the first question that was asked "Pauls answer is different "                         
11 Greg       0.667       1 this is the first question that was asked this is Gregs answer to this question nearly the same
12 Greg       0.667       2 this is the first question that was asked this is Gregs answer to this question 

The values in similarity match the ones shown in your example endproduct (note that values shown are rounded for tibble display), so it seems to be working as intended.

Luke C
  • 10,081
  • 1
  • 14
  • 21
  • 1
    This works brilliantly. I've been experimenting with writing small functions for each 'step' in the process when working on projects such as this. Thinking more globally as you've done here is where I need to go. Thanks again. – Will Hauser Jul 20 '18 at 15:57
0

I gave up and did this manually one dataframe at a time. I'm sure there's a simple way to do it as a list but I can't for the life of me figure out how to apply a list of functions (the vocab vectorizers) to the "Answertext" column in the list of dataframes.

As powerful as R is, a simple for loop that allows text swapping into the command (a la Stata's "foreach") is grossly lacking. I get that there is a different workflow involving breaking a dataframe into a list and iterating over that but for some activities this complicates matters grossly, necessitating complex indexes to refer not just to the list but also to the specific vectors contained in the list. I also recognize that the Stata-like behavior can be achieved using assign and paste0 but this, like most code in R, is terribly clunky and obtuse. sigh.

Will Hauser
  • 197
  • 7