1

I have list of sentence and I want to create skipgram (window size = 3) but I DONT want the counter to span across sentences since they are all unrelated.

So, if I have the sentences:

[["my name is John"] , ["This PC is black"]]

the triplets will be:

[my name is]
[name is john]
[this PC is]
[PC is black]

What is the best way to do it?

alvas
  • 115,346
  • 109
  • 446
  • 738
oren_isp
  • 729
  • 1
  • 7
  • 22

3 Answers3

2

Here is a simple function to do it.

def skipgram(corpus, window_size = 3):
    sg = []
    for sent in corpus:
        sent = sent[0].split()
        if len(sent) <= window_size:
            sg.append(sent)
        else:
            for i in range(0, len(sent)-window_size+1):
                sg.append(sent[i: i+window_size])
    return sg

corpus = [["my name is John"] , ["This PC is black"]]
skipgram(corups)
Ernest S Kirubakaran
  • 1,524
  • 12
  • 16
1

You don't really want a skipgram per se but you want a chunk by size, try this:

from lazyme import per_chunk

tokens = "my name is John".split()
list(per_chunk(tokens, 2))

[out]:

[('my', 'name'), ('is', 'John')]

If you want a rolling window, i.e. ngrams:

from lazyme import per_window

tokens = "my name is John".split()
list(per_window(tokens, 2))

[out]:

[('my', 'name'), ('name', 'is'), ('is', 'John')]

Similarly in NLTK for ngrams:

from nltk import ngrams

tokens = "my name is John".split()
list(ngrams(tokens, 2))

[out]:

[('my', 'name'), ('name', 'is'), ('is', 'John')]

If you want actual skipgrams, How to compute skipgrams in python?

from nltk import skipgrams

tokens = "my name is John".split()
list(skipgrams(tokens, n=2, k=1))

[out]:

[('my', 'name'),
 ('my', 'is'),
 ('name', 'is'),
 ('name', 'John'),
 ('is', 'John')]
alvas
  • 115,346
  • 109
  • 446
  • 738
0

Try this!

from nltk import ngrams

def generate_ngrams(sentences,window_size =3):
    for sentence in sentences:
        yield from ngrams(sentence[0].split(), window_size)

sentences= [["my name is John"] , ["This PC is black"]]

for c in generate_ngrams(sentences,3):
    print (c)

#output:
('my', 'name', 'is')
('name', 'is', 'John')
('This', 'PC', 'is')
('PC', 'is', 'black')
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • thanks! Given the trigram, is there a good implementation of the loss_function of word2vec? – oren_isp Dec 26 '18 at 08:16
  • @ai_learning do you know why when I run it, it doesn't print anything out, only gives “DeprecationWarning generator 'ngrams' raised StopIteration” message. I can ignore the message but it still prints nothing out when executing the code – bernando_vialli Jul 09 '19 at 15:50
  • Sounds interesting. Can you add more details n ask this as a separate question. – Venkatachalam Jul 09 '19 at 18:08