I wrote the code with extracts words from corpus, then tokenizes them and compares to sentences. The output is Bag of Words (if word is in the sentence 1, if not 0).
import nltk
import numpy as np
from nltk import FreqDist
from nltk.corpus import brown
news = brown.words(categories='news')
news_sents = brown.sents(categories='news')
fdist = FreqDist(w.lower() for w in news)
vocabulary = [word for word, _ in fdist.most_common(100)]
num_sents = len(news_sents)
for i in range(num_sents):
features = {}
for word in vocabulary:
features[word] = int(word in news_sents[i])
bow = "".join(str(n) for n in list(features.values()))
f = open("D:\\test\\Vector.txt", "a")
print(bow, file=f)
f.close()
In this case output string is 100 characters long. I want to split it into chunks of some arbitrary length and assign chunk number to it. For example:
print(i+1, chunk_id, bow, sep="\t", end="\n", file=f)
Where i+1 will be sentence id. To visualize what I mean, lets take strings of length 12 >> "110010101111" and "011011000011". It should look like:
1 1 1100
1 2 0101
1 3 1111
2 1 0110
2 2 1100
2 3 0011