1

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.

I have survey results stored in a python pandas dataframe structured like this

Question_ID | Customer_ID | Answer
  1           234         Data is very important to use because ... 
  2           234         We value data since we need it ... 

I also saved the answers column as a string.

I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)

answers_str = df.Answer.apply(str)
for value in answers_str:
   non_data = re.split('data|Data', value)
   terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
   substrs = [term.split()[0:3] for term in terms_list]  # slice and grab first three terms
   result = [' '.join(term) for term in substrs] # combine the terms back into substrings
   print result

I have been manually creating groups of words - but is there a way of doing it in python?

So based on the example shown above the group with word counts would look like this:

group "data": 
              data : 2
              important: 1
              value: 1
              need:1

then when it goes through the whole file, there would be another group:

group "analytics:
              analyze: 5
              report: 7
              list: 10
              visualize: 16

The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.

Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.

ekad
  • 14,436
  • 26
  • 44
  • 46
jeangelj
  • 4,338
  • 16
  • 54
  • 98

1 Answers1

3

We can use regex for this. We'll be using this regular expression

((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})

which you can test for yourself here, to extract the three words before and after each occurence of data

First, let's remove all the words we don't like from the strings.

import re

#    If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)

The we want to get the words that surround data in each line

data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)

gives us a list of tuples of strings. We want to get a list of those strings after they are split.

from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))

That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.

Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.

import re
from itertools import chain
from collections import Counter    

def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)

bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
    res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
    words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
    c.update(words)

The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string

base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'

such that

base_pat.format('data') == data_pat

So with some list of words we want to collect information about key_words

import re
from itertools import chain
from collections import Counter    

def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)


bad_words = ['we', 'is', 'to']

sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))

key_words = ['data', 'analytics']
d = {}

base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
    key_pat = base_pat.format(keyword)
    c = Counter()
    for sentence in sentence_list:
        res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
        words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
        c.update(words)
    d[keyword] = c

Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like

d= {'data'      : Counter({ 'important' : 2,
                            'very'      : 3}),
    'analytics' : Counter({ 'boring'    : 5,
                            'sleep'     : 3})
   }

As to how we get the top 10 words, that's basically the thing Counter is best at.

key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))
Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96
  • thank you very much; would it be easier to use the string I created instead of the pandas dataframe? answers_str = df.Answer.apply(str) – jeangelj Dec 16 '16 at 18:35
  • I get an error message SyntaxError: invalid syntax and an arrow pointing to the s in "words" – jeangelj Dec 16 '16 at 18:36
  • @jeangelj I messed up my copy paste, so there was a missing parenthesis. I edited it in, so it should work now – Patrick Haugh Dec 16 '16 at 18:39
  • thank you, I got an error message that removewords needs 2 arguments, so I added bad_words, then I received an error message that "pat" is not defined; next I got an error message that "s" was not defined, should s be sentence? – jeangelj Dec 16 '16 at 18:49
  • @jeangelj yeah, this is what happens when you prototype code in little pieces. I think I got everything this pass – Patrick Haugh Dec 16 '16 at 18:52
  • thank you very much, I really appreciate your time and help! It worked, but what do I have now? This is a list of tuples of strings, correct? and we named it words. How can I actually read/print it or put it into a new dataframe? again, I really appreciate your help – jeangelj Dec 16 '16 at 19:43
  • `c` is now a `Counter` (a special kind of dictionary), that maps strings (the words) to however many times we saw them. You can do `for a in c: print(a, c[a])` to print the words and their counts, or you can put in a dataframe like this: http://stackoverflow.com/questions/31111032/transform-a-counter-object-into-a-pandas-dataframe – Patrick Haugh Dec 16 '16 at 19:47
  • oh so this gave me the word counts, but not a grouping, correct? – jeangelj Dec 16 '16 at 19:59
  • @jeangelj Added an edit explaining how to find the top ten words, and how to get the groupings for each word. I'm not sure how to get them into a dataframe though. – Patrick Haugh Dec 16 '16 at 20:20
  • this worked! thank you very much, I will figure out how to make a pd df out of it; you are the best; THANK YOU – jeangelj Dec 16 '16 at 20:51
  • http://stackoverflow.com/questions/41192401/python-dictionary-to-pandas-dataframe-with-multiple-columns/41192439#41192439 – jeangelj Dec 16 '16 at 21:30