2

I have a set of data in a text file and i would like to build a frequency table based on pre-defined words(drive,street,i,lives). below is the example

 ID |  Text
 ---|--------------------------------------------------------------------
 1  | i drive to work everyday in the morning and i drive back in the evening on main street
 2  | i drive back in a car and then drive to the gym on 5th street
 3  | Joe lives in Newyork on NY street
 4  | Tod lives in Jersey city on NJ street

Here i what i would like to get as output

ID  |  drive |  street  |   i  |  lives
----|--------|----------|------|-------
1   |   2    |    1     |   2  |   0
2   |   2    |    1     |   1  |   0
3   |   0    |    1     |   0  |   1
4   |   0    |    1     |   0  |   1

Here is my code that i'm using and i can find the number of words but this does not solve the need for me and i would like to use a set of pre-defined words to find the counts as shown above

   from nltk.corpus import stopwords
   import string
   from collections import Counter
   import nltk
   from nltk.tag import pos_tag

   xy = open('C:\Python\data\file.txt').read().split()
   q = (w.lower() for w in xy)

   stopset = set(stopwords.words('english'))

   filtered_words = [word for word in xyz if not word  in stopset]
   filtered_words = []
   for word in xyz:
       if word not in stopset:
       filtered_words.append(word)
   print(Counter(filtered_words))
   print(len(filtered_words))
RData
  • 959
  • 1
  • 13
  • 33
  • 1
    Why do you have both a list comprehension and then the manual version straight after? – Alex Hall Nov 18 '16 at 21:41
  • What output does the code produce? – Peter Wood Nov 18 '16 at 21:45
  • Counter({'street': 4, 'drive': 4, 'back': 2, 'lives': 2, 'main': 1, 'morning': 1, 'nj': 1, '5th': 1, 'tod': 1, 'everyday': 1, 'newyork': 1, 'jersey': 1, 'joe': 1, 'city': 1, 'gym': 1, 'ny': 1, 'car': 1, 'evening': 1, 'work': 1}) – RData Nov 18 '16 at 21:48
  • @AlexHall - did not get what you meant – RData Nov 18 '16 at 21:49

4 Answers4

1

Something like sklearn.feature_extraction.text.CountVectorizer seems to be close to what you're looking for. Also, collections.Counter might be helpful. How are you planning to use this data structure? If you're trying to doing machine learning/prediction, by chance, then it's worthwhile to look into the different vectorizers in sklearn.feature_extraction.text.

Edit:

text = ['i drive to work everyday in the morning and i drive back in the evening on main street',
        'i drive back in a car and then drive to the gym on 5th street',
        'Joe lives in Newyork on NY street',
        'Tod lives in Jersey city on NJ street']

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['drive', 'street', 'i', 'lives']

vectorizer = CountVectorizer(vocabulary = vocab)

# turn the text above into a matrix of shape R X C
# where R is number of rows (elements in your text array)
# and C is the number of elements in the set of all words in your text array
X = vectorizer.fit_transform(text)

# sparse to dense matrix
X = X.toarray()

# get the feature names from the already-fitted vectorizer
vectorizer_feature_names = vectorizer.get_feature_names()

# prove that the vectorizer's feature names are identical to the vocab you specified above
assert vectorizer_feature_names == vocab

# make a table with word frequencies as values and vocab as columns
out_df = pd.DataFrame(data = X, columns = vectorizer_feature_names)

print(out_df)

And, your result:

       drive  street  i  lives
    0      2       1  0      0
    1      2       1  0      0
    2      0       1  0      1
    3      0       1  0      1
blacksite
  • 12,086
  • 10
  • 64
  • 109
  • i'm not sure if i can use predefined words to find the frequency using sklearn.feature_extraction.text. i presently would need to find the frequency only for certain words – RData Nov 18 '16 at 21:50
  • Worked up perfect and thanks for sharing as i did not know how to use pre-defined words in CountVectorizer . Also, there is another newbie doubt that i have - i have made some changes to the above code(remove stopwords,punctuations etc..) and tried to run on a file with 2000 records and when i output to a text file or output using PyCharm, i see few records and then see bunch of blank line ............... and then see last few lines. How can i correct this ? – RData Nov 18 '16 at 22:56
  • If you're talking about the matrix `X`, `numpy` limits how much of an array is printed to save your console from printing thousands and thousands of rows' worth of data. Your data is there; it's just not displayed in that view. [Here](http://stackoverflow.com/questions/1987694/print-the-full-numpy-array) is something worth reading if you're interested in print the full matrix (although with 2000 records, I wouldn't recommend it!). – blacksite Nov 18 '16 at 23:10
0

Simply ask for the words you want instead of the stop words you don't want:

filtered_words = [word for word in xyz if word in ['drive', 'street', 'i', 'lives']]
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
0

If you want to find the amount of a certain word in a list, you can use list.count(word) to find that, so if you have a list of words you want to get frequencies of, you can do something like this:

wanted_words = ["drive", "street", "i", "lives"]
frequencies = [xy.count(i) for i in wanted_words]
masteryoom
  • 81
  • 6
0

Based on Alex Halls idea to pre-filter - afterwards just use defaultdict. It's really comfortable to use for counting.

from collections import defaultdict
s = 'i drive to work everyday in the morning and i drive back in the evening on main street'
filtered_words = [word for word in s.split() 
                  if word in ['drive', 'street', 'i', 'lives']]
d = defaultdict(int)
for k in filtered_words: 
    d[k] += 1
print(d)
lmNt
  • 3,822
  • 1
  • 16
  • 22