Word Count and cumulative sum
I have a data set up to 1.5 millions rows. This data set is a time series is a year format as shown below. I am trying to count the strings per year in a cumulative format. Example below:
lodgement_year trademark_text
1906 PEPS
1906 BILE BEANS FOR BILIOUSNESS B
1906 ZAM-BUK Z
lodgement_year
1906 {PEPS BILE BEANS FOR BILIOUSNESS B ZAM-BUK Z Z...
1907 {WHS CHERUB BLACK & WHITE SOUTHERN CROSS HISTO...
As a initial task I grouped the strings then applied a loop in all year using the code that was posted in this forum by xxx . While the loop works the following message appears straight after:
The code :
d = df_merge.groupby('lodgement_year')['trademark_text'].apply(lambda x: "{%s}" % ' '.join(x))
for name in d.index:
data = d.loc[name]
ngram_vectorizer = CountVectorizer(analyzer='word',tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
X = ngram_vectorizer.fit_transform(data.split('\n'))
vocab = list(ngram_vectorizer.get_feature_names())
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (name, freq_distribution.most_common(10))
The error message:
Traceback (most recent call last):
File "/Users/PycharmProjects/Slice_Time_Series", line 65, in X = ngram_vectorizer.fit_transform(data.split('\n'))
File "/Users/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 3081, in getattr return object.getattribute(self, name) AttributeError: 'Series' object has no attribute 'split'
The output that works before the error:
1906 [('.', 24), ("'s", 22), ('star', 18), ('&', 15), ('kodak', 12), ('co', 9), ('the', 9), ('brand', 8), ('express', 8), ('anchor', 6)]
1907 [('&', 11), ("'s", 11), ('brand', 11), ('pinnacle', 7), ('vaseline', 7), ('the', 6), ('.', 5), ('co.', 5), ('kepler', 5), ('lucas', 5)]
Any help will be greatly appreciated. As a next task Im trying to create a series which is a sum of 1906 then 1906 plus 1907 then 1906+1907+1908 I have no idea what to do yet, any guidance would be also great.
Ian