1

I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.

The head is here, as a tsv: https://pastebin.com/FnnBq9rf

3 Answers3

1

Thank you @zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!

def NounCounter(x):
   nouns = []
   for (word, pos) in x:
        if pos.startswith("NN"):
            nouns.append(word)
    return nouns

df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()
1

As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.

def NounCount(x):
    nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
    return nounCount

df["noun_count"] = df["reviews"].apply(NounCount)

df.to_csv('./dataset.csv')
Isurie
  • 310
  • 4
  • 9
0

There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.

Assume you have something like this (please correct me if it's not, as you didn't provide an example):

a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])

You can do something like this to map the Series and sum the count:

a.map(lambda x: 1 if x[1]== "verb" else 0).sum()

This will return you 2.


I grabbed a sentence from the link you shared:

text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2
TYZ
  • 8,466
  • 5
  • 29
  • 60
  • I was close to this, but it gave me the same error that I get now: "list index out of range." Here's the code I'm using `df["noun_count"] = df["pos_tag"].map(lambda x: 1 if x[1] == "NN" or "NNP" or "NNS" else 0).sum()` – itsbrycehere Feb 22 '18 at 18:57
  • @itsbrycehere If you can post the data you are working with, it will help me identify the problem. – TYZ Feb 22 '18 at 19:15
  • Thanks for you help Yilun, it's above in the [link] (https://pastebin.com/FnnBq9rf) – itsbrycehere Feb 22 '18 at 19:25
  • @itsbrycehere I didn't have that error, can you update your question and show me your `df["pos_tag"]`, just a few of them? Each row should be a tuple with two items. – TYZ Feb 22 '18 at 21:21
  • This is the tsv of the head of df["pos_tag"] https://pastebin.com/NwShUvha – itsbrycehere Feb 22 '18 at 23:14
  • I guess what I'm trying to do is something like this, but to cells in a dataframe. https://stackoverflow.com/questions/33587667/extracting-all-nouns-from-a-text-file-using-nltk – itsbrycehere Feb 23 '18 at 19:20