0

I've a list of the top 10 most occurring words the abstract of academic article. I want to count how many times those words occur in the observations of my dataset.

The top 10 words are:

top10 = ['model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance']

An example of the first 3 observations are:

column[0:3] = ['The models are showing a great performance.',
'The information and therefor the data in the text are good enough to fulfill the task.',
'Data in this way results in the best information and thus performance'.]

The provided code should return a list of total occurrences of all the words in the specific observation. I've tried the following code but it gave error: count() takes at most 3 arguments (10 given).

My code:

count = 0
for sentence in column:
    for word in sentence.split():
        count += word.lower().count('model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance')

I also want to lowercase all words and remove the punctuation. So the output should look like this:

output = (2, 4, 4)

The first observation counts 2 words of the top10 list, namely models and performance

The second observation counts 4 words of the top10 list, namely information, data, text and task

The third observation counts 4 words of the data, results, data, information and performance

Hopefully you can help me out!

1 Answers1

1

You can use regex to split and just check if it is in top 10.

count =[]
for i,sentence in enumerate(column):
    c = 0
    for word in re.findall('\w+',sentence):
        c += int(word.lower() in top10)
    count += [c]

count = [2, 4, 4]

Ben
  • 141
  • 8
  • When I try this code, I receive a count of [0, 0, 0] again. Besides that, my actual database is around 10.000 observations. Does this mean I should make a *count =* for those 10.000 observations? – Paul Engelbert Dec 06 '21 at 10:58
  • @PaulEngelbert I just use the 3 columns you mentioned. You should adjust it to suit your own code rather than just copy without checking – Ben Dec 06 '21 at 11:10
  • I understand off course, but do you know a way to make to count for 10k observations rather than used copying the count = [0, 0, 0] over and over again? – Paul Engelbert Dec 06 '21 at 11:12
  • @PaulEngelbert Fine, I just changed it – Ben Dec 06 '21 at 11:19
  • It worked! Thanks you very much! – Paul Engelbert Dec 06 '21 at 11:57