0

I´m trying to count bigrams independently of order like 'John Doe' and 'Doe John' should be counted together as 2.

Already tried some examples using text mining such as those provided on https://www.oreilly.com/library/view/text-mining-with/9781491981641/ch04.html but couldn´t find any counting that ignores order of appearance.

library('widyr')
word_pairs <- austen_section_words %>%
  pairwise_count(word, section, sort = TRUE)
word_pairs

It counts separated like this:

   <chr>     <chr>     <dbl>
 1 darcy     elizabeth 144  
 2 elizabeth darcy     144

It should look like this:

   item1     item2     n
   <chr>     <chr>     <dbl>
 1 darcy     elizabeth 288

Thanks if anyone can help me.

OTStats
  • 1,820
  • 1
  • 13
  • 22
  • 1
    Do the normal pairwise counting, then sort the the items alphabetically in each row, and end with a grouped sum. – Gregor Thomas Aug 27 '19 at 20:28
  • Take a look at https://stackoverflow.com/questions/22756392/deleting-reversed-duplicates-with-r – tmfmnk Aug 27 '19 at 20:33
  • 1
    `pairwise_count` is something completely different than bigrams. pairwise count counts the words appearing in the same section, as in your example ("darcy elizabeth" and "elizabeth darcy"), but also like "elizabeth miss" and "miss elizabeth". They will have the same counts. If you look at all the data in the word_pairs table you see that all combinations have the same count. Working with bigrams is explained at the top of the chapter. The chance that bigrams are the same if looked at independent word order is very slim after you remove the bigrams which contain stopwords. – phiver Aug 28 '19 at 09:45

2 Answers2

0

This code works. There is probably something more efficient out there though.

# Create sample dataframe
df <- data.frame(name = c('darcy elizabeth', 'elizabeth darcy', 'John Doe', 'Doe John', 'Steve Smith'))

# Break out first and last names
library(stringr)
df$first <- word(df$name,1); df$second <- word(df$name,2);

# Reorder alphabetically
df$a <- ifelse(df$first<df$second, df$first, df$second); df$b <- ifelse(df$first>df$second, df$first, df$second)

library(dplyr)
summarize(group_by(df, a, b), n())

# Yields
#  a     b         `n()`
#  <chr> <chr>     <int>
#1 darcy elizabeth     2
#2 Doe   John          2
#3 Smith Steve         1
Monk
  • 407
  • 3
  • 8
0

Tks Guys,

I considered your suggestions and tried a similar approach:

library(dplyr)
#Function to order 2 variables by alphabetical order. 
#This function below i got from another post, couldn´t remember the author ;(.
alphabetical <- function(x,y){x < y}

#Created a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")

dfSample<-data.frame(col1,col2)

#Create an empty dataframe
dfCreated <- data.frame(col1=character(),col2=character())

#for each row, I reorder the columns and append to a new dataframe
#Tks to Gregor
for(i in 1:nrow(dfSample)) {

  row <- c(as.String(dfSample[i,1]), as.String(dfSample[i,2])) 

  if(!alphabetical(row[1],row[2])){
    row <- c(row[2],row[1])
  }

  dfCreated<-rbind(dfCreated,c(row[1],row[2]),stringsAsFactors=FALSE)

}
colnames(dfCreated)<-c("col1","col2")

dfCreated

#tks to Monk
summarize(group_by(dfCreated, col1, col2), n())

col1  col2      `n()`
  <chr> <chr>     <int>
1 darcy elizabeth     4
2 doe   john          2