0

So I have a pandas dataframe with rows of tokenized strings in a column named story. I also have a list of words in a list called selected_words. I am trying to count the instances of any of the selected_words in each of the rows in the column story.

The code I used before that had worked is

CCwordsCount=df4.story.str.count('|'.join(selected_words))

This is now giving me NaN values for every row.

Below is the first few rows of the column story in df4. The dataframe contains a little over 400 rows of NYTimes Articles.

0      [it, was, a, curious, choice, for, the, good, ...
1      [when, he, was, a, yale, law, school, student,...
2      [video, bitcoin, has, real, world, investors, ...
3      [bitcoin, s, wild, ride, may, not, have, been,...
4      [amid, the, incense, cheap, art, and, herbal, ...
5      [san, francisco, eight, years, ago, ernie, all...

This is the list of selected_words

selected_words = ['accept', 'believe', 'trust', 'accepted', 'accepts', 'trusts', 'believes', \
                  'acceptance', 'trusted', 'trusting', 'accepting', 'believes', 'believing', 'believed',\
                 'normal', 'normalize', ' normalized', 'routine', 'belief', 'faith', 'confidence', 'adoption', \
                  'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves']

Link to my df4 .csv file

  • Is each story entry a list containing a string as in `["it, was, a, curious, choice, for, the, good, ..."]`? – DarrylG May 13 '20 at 15:37
  • Yes I believe that each entry is a list of words. I used .split to separate the sentences into words. The counts need to be associated with each entry because I am correlating the counts with other data from the same dates as the stories. – Jesse-Burton Nicholson May 13 '20 at 15:50

2 Answers2

0

.find() function can be useful. And this can be implemented in many different ways. If you don't have any other purpose for the raw article and it can be a bunch of string. Then try this, you can also put them in a dictionary and loop over.

def find_words(text, words):
    return [word for word in words if word in text]

sentences = "0  [it, was, a, curious, choice, for, the, good, 1      [when, he, was, a, yale, law, school, student, 2      [video, bitcoin, has, real, world, investors, 3      [bitcoin, s, wild, ride, may, not, have, been, 4      [amid, the, incense, cheap, art, and, herbal, 5      [san, francisco, eight, years, ago, ernie, all"

search_keywords=['accept', 'believe', 'trust', 'accepted', 'accepts', 'trusts', 'believes', \
                  'acceptance', 'trusted', 'trusting', 'accepting', 'believes', 'believing', 'believed',\
                 'normal', 'normalize', ' normalized', 'routine', 'belief', 'faith', 'confidence', 'adoption', \
                  'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves', 'good']

found = find_words(sentences, search_keywords)

print(found)

Note : I didn't have panda data frame in mind whine I create this.

parlad
  • 1,143
  • 4
  • 23
  • 42
0

Each story entry appears to be a list containing a string.

Use map to get the string from the list before applying str as follows.

CCwordsCount = df4.story.map(lambda x: ''.join(x[1:-1])).str.count('|'.join(selected_words))

print(CCwordsCount.head(20))   # Show first 20 story results

Output

0      1
1      2
2      5
3      7
4      0
5      1
6     10
7      8
8      2
9      2
10     8
11     0
12     0
13     2
14     0
15     4
16     2
17     9
18     0
19     0
Name: story, dtype: int64

Explanation

Each story was in a list converted to a string, so basically:

"['it', 'was', 'a', 'curious', 'choice', 'for', 'the', 'good', 'wife', ...]"

Converted to list of words by dropping '[' and ']' and concatenating words

map(lambda x: ''.join(x[1:-1]))

This results in words separated by commas in quotes. For first row this results in the string:

'it', 'was', 'a', 'curious', 'choice', 'for', ...
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • @Jesse-BurtonNicholson Check one of your stories. Does it have any of the selected words? – DarrylG May 13 '20 at 15:54
  • Yes, there are over 400 stories and most of them contain at least one of the selected_words – Jesse-Burton Nicholson May 13 '20 at 15:55
  • @Jesse-BurtonNicholson--the abbreviated version of the stories you provided did not have the selected words. Can you give me more of the stories to test with? Also, when I add some words from the stories in your post to the selected words then I get non-zero counts. – DarrylG May 13 '20 at 15:56
  • I have updated the post with a link to the exported .csv file for df4 – Jesse-Burton Nicholson May 13 '20 at 16:07
  • BTW, Thanks so much for your help with this. My code worked at the end of last year and I just came back to it to update the project and it didn't work, it's very frustrating. – Jesse-Burton Nicholson May 13 '20 at 16:11