how to count n grams from a column

Question

So I am using N-grams for the first time. What I have done is I took a df with multiple rows and columns. I removed the stop words and tokenized them. My Code is this

from nltk.corpus import stopwords
stop = stopwords.words('english')

# Exclude stopwords with Python's list comprehension and pandas.DataFrame

testdf['issues_without_stopwords'] = testdf['issue'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop) if x[0]]))
testdf['questions_without_stopwords'] = testdf['question'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))



# Remove Punctuations and Tokenize
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
testdf['questions_tokenized'] = testdf['question'].apply(lambda x: tokenizer.tokenize(x))
testdf['issue_tokenized'] = testdf['issue'].apply(lambda x: tokenizer.tokenize(x))
testdf["Concate"] = testdf['issue_tokenized']+ testdf['questions_tokenized']


#Create your n-grams (1st method)

def find_ngrams(input_list, n):
  return list(zip(*[input_list[i:] for i in range(n)]))


df1 = testdf["Concate"].apply(lambda x: find_ngrams(x, 4))

from itertools import tee, islice
from collections import Counter

#Create your n-grams and count them in cell (2nd method)
def ngrams(lst, n):
  tlst = lst
  while True:
    a, b = tee(tlst)
    l = tuple(islice(a, n))
    if len(l) == n:
      yield l
      next(b)
      tlst = b
    else:
      break

    df2 = Counter(ngrams(df2["value"], 4))

I was then able to convert them into 4-gram.

This is my raw sample data:

        issue           question
0   Menstrual health    How to get my period back
1   stomach pain        any advise
2   Vaping              I am having a tonsillectomy tomorrow
3   Mental health       Ive been feeling sad most of the time
4   Kidney stone        I was diagnosed with one Saturday at Er

What I want is a column with all the n grams and another column with its freq. something like this:

N - grams                  Freq

[(n, gram, talha)]          2 

[(talha, software, python)] 1

I also need to remove all the duplicate n grams, for example [(n, gram, talha)] and [(talha, gram, n)] should be counted as 2 but shown once (I just wanted to be clear I know I said freq before lol).

EDIT: To avoid confusion, this is what I get right now:

Concate
0   [('Menstrual', 'health', 'How', 'to'), ('health', 'How', 'to', 'get'), ('How', 'to', 'get', 'my')]
1   [('stomach', 'pain', 'any', 'advise')]
2   [('Vaping', 'with', 'nicotine', 'before'), ('with', 'nicotine', 'before', 'tonsillectomy')]
3   [('Mental', 'health', 'Ive', 'been'), ('health', 'Ive', 'been', 'feeling'), ('Ive', 'been', 'feeling', 'sad'), ('been', 'feeling', 'sad', 'most'), ('feeling', 'sad', 'most', 'of'), ('sad', 'most', 'of', 'the'), ('most', 'of', 'the', 'time'), ('of', 'the', 'time', 'and')]
4   [('Kidney', 'stone', 'I', 'was'), ('stone', 'I', 'was', 'diagnosed'), ('I', 'was', 'diagnosed', 'with'), ('was', 'diagnosed', 'with', 'one')]

Is that code complete? I don't see `testdf` being defined anywhere. — Nova, Aug 08 '18 at 19:16
Are you sure you want to consider `[(n, gram, talha)]` and `[(talha, gram, n)]` as equal? N-grams are usually defined as *sequences* of words, so order is significant. — Nova, Aug 08 '18 at 19:27
In your output example, shouldn't `[talha, software, python]` be `[(talha, software, python)]`? — Nova, Aug 08 '18 at 19:28
testdf was just me loading head of 5 from my data nothing else. Yes I want to see the 2 as equal because the order isn't very significant right now. — Talha Qadeer, Aug 08 '18 at 19:31
Possible duplicate of [n-grams in python, four, five, six grams?](https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams) — Yohanes Gultom, Aug 09 '18 at 05:17
I think my problem is different. I can convert my data into n grams and count it but right now this code works row-wise. I ll update my question to show how my result currently looks. — Talha Qadeer, Aug 09 '18 at 09:17

how to count n grams from a column

0 Answers0