2

Word Count and cumulative sum

I have a data set up to 1.5 millions rows. This data set is a time series is a year format as shown below. I am trying to count the strings per year in a cumulative format. Example below:

lodgement_year                trademark_text
  1906                          PEPS
  1906  BILE BEANS FOR BILIOUSNESS B
  1906                     ZAM-BUK Z

  lodgement_year
  1906    {PEPS BILE BEANS FOR BILIOUSNESS B ZAM-BUK Z Z...

  1907    {WHS CHERUB BLACK & WHITE SOUTHERN CROSS HISTO...

As a initial task I grouped the strings then applied a loop in all year using the code that was posted in this forum by xxx . While the loop works the following message appears straight after:

    The code :

    d = df_merge.groupby('lodgement_year')['trademark_text'].apply(lambda x: "{%s}" % ' '.join(x))


  for name in d.index:
            data = d.loc[name]
            ngram_vectorizer =    CountVectorizer(analyzer='word',tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)              
            X = ngram_vectorizer.fit_transform(data.split('\n'))
            vocab = list(ngram_vectorizer.get_feature_names())
            counts = X.sum(axis=0).A1
            freq_distribution = Counter(dict(zip(vocab, counts)))
            print (name, freq_distribution.most_common(10))

The error message:

Traceback (most recent call last):

File "/Users/PycharmProjects/Slice_Time_Series", line 65, in X = ngram_vectorizer.fit_transform(data.split('\n'))

File "/Users/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 3081, in getattr return object.getattribute(self, name) AttributeError: 'Series' object has no attribute 'split'

The output that works before the error:

 1906 [('.', 24), ("'s", 22), ('star', 18), ('&', 15), ('kodak', 12), ('co', 9), ('the', 9), ('brand', 8), ('express', 8), ('anchor', 6)]
 1907 [('&', 11), ("'s", 11), ('brand', 11), ('pinnacle', 7), ('vaseline', 7), ('the', 6), ('.', 5), ('co.', 5), ('kepler', 5), ('lucas', 5)]

Any help will be greatly appreciated. As a next task Im trying to create a series which is a sum of 1906 then 1906 plus 1907 then 1906+1907+1908 I have no idea what to do yet, any guidance would be also great.

Ian

cs95
  • 379,657
  • 97
  • 704
  • 746
Ian_De_Oliveira
  • 291
  • 5
  • 16

1 Answers1

2

You can first group your DF by the lodgement variable, and then iterate over name-group pairs, do another groupby, and finally join the results.

import pandas as pd
from collections import Counter
df = pd.DataFrame({'lodg':[1,2,1,3,1,2,2,3,1,1],
               'text':['x y z','y y','x y','z x','y x','y y z','x z','x x','x x','y z']})
grouped = df.groupby('lodg')
joined = []
for name, group in grouped:
    texts = " ".join(group['text'])
    groupCounts = Counter(texts.split(" ")).items()
    joined.append([name, [texts], groupCounts])
groupedJoined = pd.DataFrame(joined, columns = ['lodg','texts','textCounts'])

Result:

In [16]: groupedJoined
Out[16]: 
   lodg                    texts                textCounts
0     1  [x y z x y y x x x y z]  [(y, 4), (x, 5), (z, 2)]
1     2          [y y y y z x z]  [(y, 4), (x, 1), (z, 2)]
2     3                [z x x x]          [(x, 3), (z, 1)]

After the explicit solution, it took a little to come up with the 1-liner lambda solution.

Two Lambdas:

df.groupby('lodg')['text'].apply(lambda x: "%s" % ' '.join(x)).apply(lambda x: Counter(x.split(" ")).items())

Single Lambda:

df.groupby('lodg')['text'].apply(lambda x: Counter((' '.join(x)).split(" ")).items())

Both produce the same result:

Out[62]: 
lodg
1    [(y, 4), (x, 5), (z, 2)]
2    [(y, 4), (x, 1), (z, 2)]
3            [(x, 3), (z, 1)]
Name: text, dtype: object

Now, if you don't want to deal with the unnamed column (actually called index), name the result res, and do res.reset_index() to get this:

In [68]: res.reset_index()
Out[68]: 
   lodg                      text
0     1  [(y, 4), (x, 5), (z, 2)]
1     2  [(y, 4), (x, 1), (z, 2)]
2     3          [(x, 3), (z, 1)]

This should be much more straightforward to work with.

FatihAkici
  • 4,679
  • 2
  • 31
  • 48
  • Hi Fatih , and thanks for providing me some guidance. The code works however I need to count every single individual word. For example : BILE BEANS FOR BILIOUSNESS B , the result shoul be BILE 1 , BEANS 1 ... Not BILE BEANS FOR BILIOUSNESS B 1. – Ian_De_Oliveira Dec 26 '17 at 23:35
  • If I groupby using lambda to break down all words, then I get the error message , too many values to unpack. grouped = df.groupby('lodgement_year')['trademark_text'].apply(lambda x: "{%s}" % ' '.join(x)) joined = [] for name, group in grouped: groupCounts = Counter(list(group['trademark_text'])).items() joined.append([name, [" ".join(group['trademark_text'])], groupCounts]) groupedJoined = pd.DataFrame(joined, columns = ['lodgement_year','texts','textCounts']) – Ian_De_Oliveira Dec 26 '17 at 23:43
  • @Ian_De_Oliveira please see the updated answer. I further broke down the texts such as BILE BEANS FOR BILIOUSNESS B into BILE - BEANS - FOR - BILIOUSNESS - B. In my analogy you have texts like 'x y z' and 'y y' for every lodgement_year, and my updated answer counts each and every x's, y's, and z's; instead of counting 'x y z' and 'y y' strings. I couldn't think of an easy solution using lambda. Lambda notation may not always be easy, but if someone else proposes a lambda-based solution, I will enjoy learning it :) – FatihAkici Dec 27 '17 at 03:00
  • @Ian_De_Oliveira I figured out the lambdas as well. Please see the updated answer. Let me know if you need some more improvement/clarification. – FatihAkici Dec 27 '17 at 03:23
  • Wow , all the complexity I was trying to go trough it worked in a one line code, genius. Sorry my question , Im a really beginner in the coding word , How would you iterate to create a new cumulative observation. For example 1906 + 1907 then 1906+1907+1908 and so on? Im trying to crate a cumulative sum of the unique words then take a average after by the number of observations or trademarks applications yearly. – Ian_De_Oliveira Dec 27 '17 at 07:16
  • @Faith Akici . Thanks for your awesome code, i been trying to sort the results , however it keep failing. Do you know where should I look for more info on groupby functions ? Thanks a lot again. – Ian_De_Oliveira Dec 27 '17 at 08:00
  • How would you unpack in this case and send to a csv or excel file , like below : Words 1906 1907 1908 peps 1 0 6 bile 1 0 4 beans 1 5 1 for 1 7 3 whs 0 2 4 black 0 1 4 – Ian_De_Oliveira Dec 27 '17 at 09:01
  • @Ian_De_Oliveira I can definitely help with that also but it would be better if you asked that part as a new question. You can basically summarize what we've achieved in this thread, share the code so people can reproduce, and then ask the second part of your question. I'll be more than happy to help! – FatihAkici Dec 28 '17 at 02:50
  • @ Thanks for your advice , I will post in as a new question:) – Ian_De_Oliveira Dec 30 '17 at 03:28
  • You are welcome, @Ian_De_Oliveira I hope I could help. Will be glad to help more. Good luck! – FatihAkici Dec 30 '17 at 03:31
  • 1
    https://stackoverflow.com/questions/48033351/cumulative-sum-using-2-columns – Ian_De_Oliveira Dec 30 '17 at 11:11