2

I want to prevent certain phrases for creeping into my models. For example, I want to prevent 'red roses' from entering into my analysis. I understand how to add individual stop words as given in Adding words to scikit-learn's CountVectorizer's stop list by doing so:

from sklearn.feature_extraction import text
additional_stop_words=['red','roses']

However, this also results in other ngrams like 'red tulips' or 'blue roses' not being detected.

I am building a TfidfVectorizer as part of my model, and I realize the processing I need might have to be entered after this stage but I am not sure how to do this.

My eventual aim is to do topic modelling on a piece of text. Here is the piece of code (borrowed almost directly from https://de.dariah.eu/tatom/topic_model_python.html#index-0 ) that I am working on:

from sklearn import decomposition

from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']

sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=5
)

dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
    n_topics=num_topics,
    random_state=1
)

doctopic = m_clf.fit_transform(dtm)
topic_words = []

for topic in m_clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
    print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))

EDIT

Sample dataframe (I have tried to insert as many edge cases as possible), df:

   Content
0  I like red roses as much as I like blue tulips.
1  It would be quite unusual to see red tulips, but not RED ROSES
2  It is almost impossible to find blue roses
3  I like most red flowers, but roses are my favorite.
4  Could you buy me some red roses?
5  John loves the color red. Roses are Mary's favorite flowers.
Melsauce
  • 2,535
  • 2
  • 19
  • 39

4 Answers4

7

TfidfVectorizer allows for a custom preprocessor. You can use this to make any needed adjustments.

For example, to remove all occurrences of consecutive "red" + "roses" tokens from your example corpus (case-insensitive), use:

import re
from sklearn.feature_extraction import text

cases = ["I like red roses as much as I like blue tulips.",
         "It would be quite unusual to see red tulips, but not RED ROSES",
         "It is almost impossible to find blue roses",
         "I like most red flowers, but roses are my favorite.",
         "Could you buy me some red roses?",
         "John loves the color red. Roses are Mary's favorite flowers."]

# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
    # note: this regex considers "... red. Roses..." as fair game for removal.
    #       if that's not what you want, just use ["red roses"] instead.
    stop_phrases= ["red(\s?\\.?\s?)roses"]
    for phrase in stop_phrases:
        doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
    return doc

sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=1,
    preprocessor=remove_stop_phrases  # define our custom preprocessor
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

Now vocab has all red roses references removed.

print(sorted(vocab))

['Could buy',
 'It impossible',
 'It impossible blue',
 'It quite',
 'It quite unusual',
 'John loves',
 'John loves color',
 'Mary favorite',
 'Mary favorite flowers',
 'blue roses',
 'blue tulips',
 'color Mary',
 'color Mary favorite',
 'favorite flowers',
 'flowers roses',
 'flowers roses favorite',
 'impossible blue',
 'impossible blue roses',
 'like blue',
 'like blue tulips',
 'like like',
 'like like blue',
 'like red',
 'like red flowers',
 'loves color',
 'loves color Mary',
 'quite unusual',
 'quite unusual red',
 'red flowers',
 'red flowers roses',
 'red tulips',
 'roses favorite',
 'unusual red',
 'unusual red tulips']

UPDATE (per comment thread):

To pass in desired stop phrases along with custom stop words to a wrapper function, use:

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
  • What would the changes to the code (in the list stop_phrases) be for multiple stop phrases? – Melsauce Aug 09 '17 at 02:25
  • Just add to the `stop_phrases` list. The function loops over every phrase and removes it from the corpus. Like: `["red roses", "blue tulips"]` – andrew_reece Aug 09 '17 at 02:40
  • Thank you! Is there a way I can pass the stop words list as an argument to remove_stop_phrases() function? My use case requires me to do all of the above processing within a larger function to which stop phrases would be input as per need. – Melsauce Aug 09 '17 at 09:08
  • The preprocessor in `TfidfVectorizer` doesn't accept additional arguments. One option is to pass in the custom stop phrase list to your wrapper function, and then have `remove_stop_phrases` refer to the wrapper argument. I've added an update to my answer to demonstrate. – andrew_reece Aug 09 '17 at 15:09
  • It works great, except I'm facing two issues: 1) It seems to neglect the stop words that were passed. Sample the following: – Melsauce Aug 10 '17 at 12:01
  • swords=['Could', 'buy'] sw=text.ENGLISH_STOP_WORDS.union(swords) stop_phrases=['red roses'] Output of running the code: Topic 0: Could buy,It quite unusual,unusual red tulips,John loves color,color red Topic 1: impossible blue roses,It impossible blue,blue roses,impossible blue,It impossible Topic 2: red flowers,like red,red flowers roses,roses favorite,like red flowers Topic 3: quite unusual,It quite,unusual red tulips,unusual red,quite unusual red Topic 4: like like,like blue,like like blue,blue tulips,like blue tulips It still uses the stop words 'Could' and 'buy'. – Melsauce Aug 10 '17 at 12:03
  • 2) Is there a way to make this more efficient? It throws a memory error for moderate sized dataframes. Thank you for all your help! – Melsauce Aug 10 '17 at 12:04
  • Re (1), you can pass in `swords` just like you pass in `stop_phrases`. I've made a modification to my answer showing this. Re (2), that's surprising - not sure what's going on, as it should be able to handle even larger dataframes. Might be worth posting a separate question about efficiency improvements. – andrew_reece Aug 11 '17 at 04:13
  • @andrew_reece is there any possibility that i can only allow n grams with stop words between them, NOT in start and end using some other pre processor or using this one? – Shan Khan Apr 09 '18 at 21:07
  • Why using preprocessor to remove stop phrases? The better way is remove all stop phrases from the whole text at beginning. Using preprocessor causes performance issues. – TomSawyer Feb 19 '20 at 10:10
2

You can switch out the tokenizer of the TfidfVectorizer by passing a keyword argument tokenizer (doc-src)

the original looks like this:

def build_tokenizer(self):
    """Return a function that splits a string into a sequence of tokens"""
    if self.tokenizer is not None:
        return self.tokenizer
    token_pattern = re.compile(self.token_pattern)
    return lambda doc: token_pattern.findall(doc)

So let's make a function that removes all the word combinations you don't want. First let's define the expressions you don't want:

unwanted_expressions = [('red','roses'), ('foo', 'bar')]

and the function would need to look something like this:

token_pattern_str = r"(?u)\b\w\w+\b"
def my_tokenizer(doc):
    """split a string into a sequence of tokens
    and remove some words along the way."""

    token_pattern = re.compile(token_pattern_str)
    tokens = token_pattern.findall(doc)
    for i in range(len(tokens)):
        for expr in unwanted_expressions:
            found = True
            for j, word in enumerate(expr):
                found = found and (tokens[i+j] == word)
            if found:
                tokens[i:i+len(expr)] = len(expr) * [None]
    tokens = [x for x in tokens if x is not None]
    return tokens

I have not tried this specifically out myself, but i have switched out the tokenizer before. It works well.

Good luck :)

Philip Stark
  • 598
  • 1
  • 4
  • 12
  • Thanks @Philip Stark. Do I basically just give the argument ' tokenizer=my_tokenizer( df[ 'content' ] ) ' when I call the TfidfVectorizer? I have edited my post to provide a sample df with the content column. – Melsauce Aug 07 '17 at 01:57
  • Actually, you just give tokenizer=my_tokenizer. Don't call it yet. It's a function object. The Vectorizer will call it when it's appropriate. See my link to the original code to understand what it does exactly. – Philip Stark Aug 07 '17 at 10:48
  • It's working well for the test dataframe, however I'm getting the 'IndexError: list index out of range' error for a different dataframe. – Melsauce Aug 10 '17 at 11:31
0

Before passing df to mod_vectorizer you should use something like the next:

df=["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]

df=[ i.lower() for i in df]
df=[i if 'red roses' not in i else i.replace('red roses','') for i in df]

If you are checking for more than "red roses" then replace the last line in the above with:

stop_phrases=['red roses']
def filterPhrase(data,stop_phrases):
 for i in range(len(data)):
     for x in stop_phrases:
         if x in data[i]:
             data[i]=data[i].replace(x,'')
 return data
df=filterPhrase(df, stop_phrases)
Juli
  • 1,011
  • 8
  • 16
-2

For Pandas, you want to use List Compression

.apply(lambda x: [item for item in x if item not in stop])
liam
  • 1,918
  • 3
  • 22
  • 28