Understanding another's text-mining function that removes similar strings

Question

I’m trying to replicate the methodology from this article, 538 Post about Most Repetitive Phrases, in which the author mined US presidential debate transcripts to determine the most repetitive phrases for each candidate.

I'm trying to implement this methodology with another dataset in R with the tm package.

Most of the code (GitHub repository) concerns mining the transcripts and assembling counts of each ngram, but I get lost at the prune_substrings() function code below:

def prune_substrings(tfidf_dicts, prune_thru=1000):

    pruned = tfidf_dicts

    for candidate in range(len(candidates)):
        # growing list of n-grams in list form
        so_far = []

        ngrams_sorted = sorted(tfidf_dicts[candidate].items(), key=operator.itemgetter(1), reverse=True)[:prune_thru]
        for ngram in ngrams_sorted:
            # contained in a previous aka 'better' phrase
            for better_ngram in so_far:
                if overlap(list(better_ngram), list(ngram[0])):
                    #print "PRUNING!! "
                    #print list(better_ngram)
                    #print list(ngram[0])

                    pruned[candidate][ngram[0]] = 0
            # not contained, so add to so_far to prevent future subphrases
            else:
                so_far += [list(ngram[0])]

    return pruned

The input of the function, tfidf_dicts, is an array of dictionaries (one for each candidate) with ngrams as keys and tf-idf scores as values. For example, Trump's tf-idf dict begins like this:

trump.tfidf.dict = {'we don't win': 83.2, 'you have to': 72.8, ... }

so the structure of the input is like this:

tfidf_dicts = {trump.tfidf.dict, rubio.tfidf.dict, etc }

MY understanding is that prune_substrings does the following things, but I'm stuck on the else if clause, which is a pythonic thing I don't understand yet.

A. create list : pruned as tfidf_dicts; a list of tfidf dicts for each candidate

B loop through each candidate:

so_far = start an empty list of ngrams gone through so so_far

ngrams_sorted = sorted member's tf-idf dict from smallest to biggest

loop through each ngram in sorted

loop through each better_ngram in so_far

IF overlap b/w (below) == TRUE:

better_ngram (from so_far) and

ngram (from ngrams_sorted)

THEN zero out tf-idf for ngram

ELSE if (WHAT?!?)

add ngram to list, so_far

C. return pruned, i.e. list of unique ngrams sorted in order

Any help at all is much appreciated!

beroe · Answer 1 · 2016-05-03T06:49:39.807

1

Note the indentation in your code... The else is lined up with the second for, not the if. This is a for-else construct, not an if-else.

In that case, the else is being used to initialize the inner loop, because it will be executed when so_far is empty the first time through, and each time the inner loop runs out of items to iterate through...

I am not sure that this is the most efficient way to achieve these comparisons, but conceptually you can get a sense of the flow with this snippet:

s=[]
for j in "ABCD":
   for i in s:
      print i,
   else:
       print "\nelse"
       s.append(j)

Output:

else
A 
else
A B 
else
A B C 
else

I would think that in R there is a much better way to do this than nested loops....

edited May 03 '16 at 06:49

answered May 03 '16 at 06:44

beroe

11,784
5
34
79

1

Your snippet was especially helpful for me. For reference, [This on SO Answer to "Why does python use 'else' after for and while loops?](http://stackoverflow.com/a/9980160/3842610), gives a more in depth explanation. – Anders Swanson May 04 '16 at 01:57
@AndersSwanson in the back of my mind, I wonder if there is an off-by-one error in their script, since in my toy example D is not printed. Is it tested in theirs..? – beroe May 09 '16 at 15:11

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

4 months later but here's my solution. I'm sure there is a more efficient solution, but for my purposes, it worked. The pythonic for-else doesn't translate to R. So the steps are different.

Take top n ngrams.
Create a list, t, where each element of the list is a logical vector of length n that says whether ngram in question overlaps all other ngrams (but fix 1:x to be false automatically)
Cbind together every element of t into a table, t2
Return only elements of t2 row sum is zero set elements 1:n to FALSE (i.e. no overlap)

Ouala!

PrunedList Function

#' GetPrunedList
#' 
#' takes a word freq df with columns Words and LenNorm, returns df of nonoverlapping strings
GetPrunedList <- function(wordfreqdf, prune_thru = 100) {
        #take only first n items in list
        tmp <- head(wordfreqdf, n = prune_thru) %>%
                select(ngrams = Words, tfidfXlength = LenNorm)
        #for each ngram in list:
        t <- (lapply(1:nrow(tmp), function(x) {
                #find overlap between ngram and all items in list (overlap = TRUE)
                idx <- overlap(tmp[x, "ngrams"], tmp$ngrams)
                #set overlap as false for itself and higher-scoring ngrams
                idx[1:x] <- FALSE
                idx
        }))
        
        #bind each ngram's overlap vector together to make a matrix
        t2 <- do.call(cbind, t)   
        
        #find rows(i.e. ngrams) that do not overlap with those below
        idx <- rowSums(t2) == 0
        pruned <- tmp[idx,]
        rownames(pruned) <- NULL
        pruned
}

Overlap function

#' overlap
#' OBJ: takes two ngrams (as strings) and to see if they overlap
#' INPUT: a,b ngrams as strings
#' OUTPUT: TRUE if overlap
overlap <- function(a, b) {
        max_overlap <- min(3, CountWords(a), CountWords(b))
        
        a.beg <- word(a, start = 1L, end = max_overlap)
        a.end <- word(a, start = -max_overlap, end = -1L)
        b.beg <- word(b, start = 1L, end = max_overlap)
        b.end <- word(b, start = -max_overlap, end = -1L)
        
        # b contains a's beginning
        w <- str_detect(b, coll(a.beg, TRUE))
        # b contains a's end
        x <- str_detect(b, coll(a.end, TRUE))
        # a contains b's beginning
        y <- str_detect(a, coll(b.beg, TRUE))
        # a contains b's end
        z <- str_detect(a, coll(b.end, TRUE))
        
        #return TRUE if any of above are true
        (w | x | y | z)
}

Understanding another's text-mining function that removes similar strings

2 Answers2

PrunedList Function

Overlap function