0

I wrote the code with extracts words from corpus, then tokenizes them and compares to sentences. The output is Bag of Words (if word is in the sentence 1, if not 0).

import nltk
import numpy as np
from nltk import FreqDist
from nltk.corpus import brown


news = brown.words(categories='news') 
news_sents = brown.sents(categories='news') 

fdist = FreqDist(w.lower() for w in news) 
vocabulary = [word for word, _ in fdist.most_common(100)] 
num_sents = len(news_sents) 

for i in range(num_sents):
    features = {}
    for word in vocabulary: 
        features[word] = int(word in news_sents[i]) 

    bow = "".join(str(n) for n in list(features.values()))
    f = open("D:\\test\\Vector.txt", "a") 
    print(bow, file=f) 
    f.close()

In this case output string is 100 characters long. I want to split it into chunks of some arbitrary length and assign chunk number to it. For example:

print(i+1, chunk_id, bow, sep="\t", end="\n", file=f)

Where i+1 will be sentence id. To visualize what I mean, lets take strings of length 12 >> "110010101111" and "011011000011". It should look like:

1 1 1100
1 2 0101
1 3 1111
2 1 0110
2 2 1100
2 3 0011
Masyaf
  • 843
  • 1
  • 9
  • 17

1 Answers1

0

The grouper function from the itertools documentation seems to be what you're looking for:

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)
dav1d
  • 5,917
  • 1
  • 33
  • 52