I am working on my bachelorthesis and have to prepare a corpus to train word embeddings. What I'm thinking about is if it is possible to check a tokenized sentence or text for ngrams and then exchange these single tokens with the ngram.
To make it a bit clearer what i mean:
Input
var = ['Hello', 'Sherlock', 'Holmes', 'my', 'name', 'is', 'Mr', '.', 'Watson','.']
Desired Output
var = ['Hello', 'Sherlock_Holmes', 'my', 'name', 'is', 'Mr_Watson','.']
I know Mr. Watson is not the perfect example right now. But I am thinking about if this is possible.
Because training my word2vec algorithm without looking for ngrams does not do the job well enough.
class MySentence():
def __init__(self, dirname):
self.dirname = dirname
print('Hello init')
def __iter__(self):
for fname in os.listdir(self.dirname):
txt = []
for line in open(os.path.join(self.dirname, fname)):
txt = nltk.regexp_tokenize(line, pattern='\w+|\$[\d\.]+|\S+')
tokens = [token for token in tokens if len(token) > 1] #same as unigrams
bi_tokens = bigrams(tokens)
yield tri_tokens = trigrams(tokens)
sentences = MySentence(path)