How to create window/chunk for list of sentences?

Question

I have list of sentence and I want to create skipgram (window size = 3) but I DONT want the counter to span across sentences since they are all unrelated.

So, if I have the sentences:

[["my name is John"] , ["This PC is black"]]

the triplets will be:

[my name is]
[name is john]
[this PC is]
[PC is black]

What is the best way to do it?

score 2 · Answer 1 · answered Dec 26 '18 at 07:50

2

Here is a simple function to do it.

def skipgram(corpus, window_size = 3):
    sg = []
    for sent in corpus:
        sent = sent[0].split()
        if len(sent) <= window_size:
            sg.append(sent)
        else:
            for i in range(0, len(sent)-window_size+1):
                sg.append(sent[i: i+window_size])
    return sg

corpus = [["my name is John"] , ["This PC is black"]]
skipgram(corups)

answered Dec 26 '18 at 07:50

Ernest S Kirubakaran

1,524
12
16

thank you for your answer. Do you know how you can modify this code to also include a count for each one of the outputs? – bernando_vialli Jul 09 '19 at 15:45

score 1 · Answer 2 · answered Dec 27 '18 at 04:20

You don't really want a skipgram per se but you want a chunk by size, try this:

from lazyme import per_chunk

tokens = "my name is John".split()
list(per_chunk(tokens, 2))

[out]:

[('my', 'name'), ('is', 'John')]

If you want a rolling window, i.e. ngrams:

from lazyme import per_window

tokens = "my name is John".split()
list(per_window(tokens, 2))

[out]:

[('my', 'name'), ('name', 'is'), ('is', 'John')]

Similarly in NLTK for ngrams:

from nltk import ngrams

tokens = "my name is John".split()
list(ngrams(tokens, 2))

[out]:

[('my', 'name'), ('name', 'is'), ('is', 'John')]

If you want actual skipgrams, How to compute skipgrams in python?

from nltk import skipgrams

tokens = "my name is John".split()
list(skipgrams(tokens, n=2, k=1))

[out]:

[('my', 'name'),
 ('my', 'is'),
 ('name', 'is'),
 ('name', 'John'),
 ('is', 'John')]

Venkatachalam · Accepted Answer · 2018-12-26T08:12:37.563

0

Try this!

from nltk import ngrams

def generate_ngrams(sentences,window_size =3):
    for sentence in sentences:
        yield from ngrams(sentence[0].split(), window_size)

sentences= [["my name is John"] , ["This PC is black"]]

for c in generate_ngrams(sentences,3):
    print (c)

#output:
('my', 'name', 'is')
('name', 'is', 'John')
('This', 'PC', 'is')
('PC', 'is', 'black')

edited Dec 26 '18 at 08:12

answered Dec 26 '18 at 07:56

Venkatachalam

16,288
9
49
77

thanks! Given the trigram, is there a good implementation of the loss_function of word2vec? – oren_isp Dec 26 '18 at 08:16
@ai_learning do you know why when I run it, it doesn't print anything out, only gives “DeprecationWarning generator 'ngrams' raised StopIteration” message. I can ignore the message but it still prints nothing out when executing the code – bernando_vialli Jul 09 '19 at 15:50
Sounds interesting. Can you add more details n ask this as a separate question. – Venkatachalam Jul 09 '19 at 18:08

How to create window/chunk for list of sentences?

3 Answers3