0

I am working on my bachelorthesis and have to prepare a corpus to train word embeddings. What I'm thinking about is if it is possible to check a tokenized sentence or text for ngrams and then exchange these single tokens with the ngram.

To make it a bit clearer what i mean:

Input

var = ['Hello', 'Sherlock', 'Holmes', 'my', 'name', 'is', 'Mr', '.', 'Watson','.']

Desired Output

var = ['Hello', 'Sherlock_Holmes', 'my', 'name', 'is', 'Mr_Watson','.']

I know Mr. Watson is not the perfect example right now. But I am thinking about if this is possible.

Because training my word2vec algorithm without looking for ngrams does not do the job well enough.

class MySentence():
    def __init__(self, dirname):
        self.dirname = dirname
        print('Hello init')
 
    def __iter__(self):
        
        for fname in os.listdir(self.dirname):
            txt = []
            for line in open(os.path.join(self.dirname, fname)):
                txt = nltk.regexp_tokenize(line, pattern='\w+|\$[\d\.]+|\S+')
                tokens = [token for token in tokens if len(token) > 1] #same as unigrams
                bi_tokens = bigrams(tokens)
                
                yield tri_tokens = trigrams(tokens)
     
sentences = MySentence(path)
yannickhau
  • 385
  • 1
  • 13

1 Answers1

1

N-grams are just sequences of adjacent words but they don't have to make sense language-wise. For example, "Hello Sherlock" and "Holmes my" could be 2-grams. Rather, it sounds like you are looking a more sophisticated tokenization with language-specific context, or entity recognition ("Sherlock Holmes"), which itself requires a trained model. Check out NLTK's documentation regarding nltk.ne_chunk()or rule-based chunking. Or for out-of-the-box solutions, spaCy's named entity recognition and tokenization capabilities, to get started.

Sharon Choong
  • 566
  • 2
  • 8