0

I found many posts explaining n-grams in words. However, I need to apply n-grams (n = 2 or 3) on a dataframe that has integer numbers of n x m. For example: Consider the below dataframe (3 x 5)

df = 

1, 2, 3, 4, 5

6, 7, 8, 9, 10

11, 12, 13, 14, 15 

I need to apply bigram and trigram on df.

I tried this code, but it does not work properly

for i in range(df.shape[0]):
    row = list(str(df.iloc[i,:]))
    print("row:  ", row)
    bigrams = [b for l in row for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
    print(bigrams)

If the input is df = [10,20,30,40,50,60,...]

Expected output

Bigram

(10,20)(20,30)(30,40)(40,50)...

Trigram

(10,20,30)(20,30,40)(30,40,50)...

Linux
  • 150
  • 8
Mohsen Ali
  • 655
  • 1
  • 9
  • 30
  • Considering your example input, what would be the expected output? – mozway Jul 17 '23 at 09:08
  • For example, for each row of df, if the row is [ 60, 78, 56, 78, 60, ... ], the tri-gram features are: (60, 78, 56), (78, 56, 78), (56, 78, 60), etc.; the bi-gram features are: (60, 78), (78, 56), (56, 78), (78, 60), etc. Note: we do not need the brackets. – Mohsen Ali Jul 17 '23 at 09:12
  • I get the general logic, you need to use `itertools.combinations`, but what about the specifics? Do you need new rows? A list of the n-grams? Please provide the exact input as [DataFrame constructor](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and the same for the output. – mozway Jul 17 '23 at 09:19

1 Answers1

2

Use nltk.ngrams:

from nltk.util import ngrams

# for bigrams
for a in df.values:
    print(list(ngrams(a, n=2)))

[(1, 2), (2, 3), (3, 4), (4, 5)]
[(6, 7), (7, 8), (8, 9), (9, 10)]
[(11, 12), (12, 13), (13, 14), (14, 15)]

For trigrams set n=3:

[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
[(6, 7, 8), (7, 8, 9), (8, 9, 10)]
[(11, 12, 13), (12, 13, 14), (13, 14, 15)]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • Thanks, however, I need just numbers inside a list without parenthesis, e.g., [1, 2, 3, 2, 3, 4, 3, 4, 5] [6, 7, 8, 7, 8, 9, 8, 9, 10] [11, 12, 13, 12, 13, 14, 13, 14, 15] – Mohsen Ali Jul 17 '23 at 13:14